1 Introduction

Alzheimer’s disease (AD) is the most prevalent form of dementia among the elderly, accounting for approximately 80% of all dementia cases and ranking as the sixth leading cause of death in the USA. In 2006, it was estimated that 26.6 million people worldwide were experiencing various degrees of dementia, with a projected 100 million affected individuals expected by 2050 [5]. Regrettably, as of now, there have been no drug treatments identified that can reduce the risk of developing AD or slow its progression.

The Alzheimer’s Disease Neuroimaging InitiativeFootnote 1 (ADNI), initiated in 2004, plays a vital role in advancing the development of biomarkers for early detection (diagnosis) and monitoring (prognosis) of AD. This is achieved through the utilization of longitudinal clinical, imaging, genetic, and biochemical data obtained from patients with AD, mild cognitive impairment (MCI), and healthy controls (HC). The main challenge in this project involves predicting and classifying the subject’s status, referring to their status as one of the 3 aforementioned labels. ADNI has allowed for various discoveries associated with its biomarkers. An example of this can be seen in the findings of Meyer et al. [25], which suggest that while CSF and PET tau measurements are frequently consistent, they may indicate distinct phases of tau pathological progression. Weiner et al. [43] demonstrate that abnormalities in cerebrospinal fluid tau may be detected prior to observable cognitive decline, which is earlier in the pathogenetic process of Alzheimer’s disease than a positive flortaucipir-PET result.

ADNI data exhibit several peculiarities: multimodal and heterogeneous data, missing value (MV), high-dimensional data, and historical data. These features have given rise to significant challenges when analyzing the data, and a substantial number of studies in the literature demonstrate that traditional computational data analysis techniques are not always suitable.

Pattern recognition and Machine Learning (ML) methods have played a crucial role in the automatic identification of the Alzheimer’s disease. Tasks like classification, regression, feature extraction and selection, multimodal data fusion, dimensionality reduction are central to this ongoing multidisciplinary research effort. Nevertheless, pattern analysis is impeded by the presence of MV within the ADNI dataset, including patients with incomplete records, instances where various data modalities are partially or entirely absent due to various reasons, such as high measurement costs, equipment malfunctions, suboptimal data quality, patients missing appointments or discontinuing their participation in the study, and reluctance to undergo invasive procedures.

The MV issue can be addressed in two ways [19]. First, all samples with missing records are excluded before any analysis is conducted. This is a valid approach when the percentage of excluded samples is small enough to mitigate potential study bias. Second, the missing values can be estimated from the partially measured data. This method is known as imputation [19] and is recommended when the data analysis techniques employed are not designed to handle missing entries. Although ADNI is considered a study with a large amount of data, approximately 78% of the patients have incomplete records. This implies a significant reduction in data if these records are removed, which could lead to issues of replicability [27] and representativeness of the study [33]. Therefore, the use of missing values imputation methods is essential to uncover meaningful clinical patterns. When dealing with MV in the field of ML, there are three scenarios: (1) The training data is complete, and the test data contains missing values [31], (2) The training data contains missing values, and the test data is complete [13], and (3) Both the training and test data contains missing values [14, 22].

Initially, data analysis in ADNI avoided addressing the problem of MV, but the community discovered that performing imputation as a preprocessing step before prediction led to improved final results. An example of this can be observed in the works [8, 18, 44].

This work addresses the imputation problem in ADNI to enhance the automatic classification of subjects. Its primary contribution is a multimodal imputation algorithm based on the EM algorithm that creates a model capable of impute future observations using training data, thereby enhancing the EM algorithm for imputation and allowing its use in a supervised environment. This new algorithm: (1) allows to perform data imputation in a supervised learning environment, this is, the algorithm creates an imputation model during the training phase that enables it to impute future data, allowing a posterior prediction using any desired classification method; (2) improves the classification performance of existing EM-based techniques for data imputation; (3) creates neighborhoods of subjects in a missing values environment, which can be useful for tasks related to unsupervised learning, such as clustering.

Organization of the paper: In Sect. 2, we present the state of the art of imputation methods and works on automatic classification in ADNI. In Sect. 3 we provide a comprehensive explanation of the theoretical framework underlying the EM algorithm, which serves as the foundation for our proposal. Following, we detail the proposal in Sect. 4. In Sect. 5, we provide a description of the experimental setup that we utilized to evaluate the proposed method, additionally, we present the obtained results and explain our findings. Finally, we establish the conclusions and future work in Sect. 6.

2 State of art

2.1 Imputation methods

The MV imputation task for Alzheimer’s disease diagnosis is a widely studied problem in the literature. Several machine learning methods for imputation and automatic disease identification have been jointly employed acknowledging that: a more accurate imputation does not always have a larger effect on the classification performance [2]; more complex imputation methods do not always offer significant advantages over more efficient alternatives such as mean imputation [24]; and finally, the performance of classification methods depends on the MV handling strategy, number of features, percentage of missing values and number of target classes [2].

In the literature, there are basic methods such as the use of the mean, the median and the Winsorized mean for cases where outliers are detected in the data. Hot deck and Cold deck [19] are two simple imputation methods. The first one replaces missing values with data from the most similar example to the one being imputed, based on a similarity or distance function between the observed variables. Cold deck follows the same idea as Hot deck, but it looks for the most dissimilar example to the one being imputed.

Among the more sophisticated methods, you can find algorithms that perform iterative imputations of missing values. One example is Multiple Imputation [19]. This was one of the first methods in which the final calculation of the missing values is done through multiple calculations that are weighted together. The algorithm provides accurate estimates of quantities or associations of interest, such as treatment effects in randomized trials, sample means of specific variables, correlations between 2 or more variables, as well as the related variances. Another example is MICE [29], that performs an initial imputation using a simple and quick method, such as the mean. Subsequently, for each imputed data point, regression models are employed to update the previous imputations. This process allows values to be updated as many times as the iterations one chooses to perform.

Matrix completion is the task of filling in the MV of a partially observed matrix, which is equivalent to performing data imputation in statistics. In [39], imputation for microarray data is performed using Singular Value Decomposition (SVD), based on the decomposition \(X = U\Sigma V^{t}\). This method, assisted by eigenvectors and eigenvalues, performs data imputation using a linear regression model iteratively. Since SVD can only be used on complete matrices, it initially proceeds to impute the data using the mean method. Following the idea of SVD, there is the Singular Value Thresholding (SVT) algorithm [7], which is an algorithm that minimizes the nuclear norm (also known as the trace norm) of a matrix, subject to certain types of constraints. Generally, these algorithms do not create a model to enable imputation of future data. This creates a disadvantage when working in a supervised learning environment.

Machine learning methods have also been used for imputing missing values. These methods create an imputation model using training data that allows for estimating test data (future data). One of the most commonly used and simple methods is the K-nearest neighbors imputation method or \(K-NN\). This method imputes missing values by leveraging the K most similar examples based on a distance metric. Once the K nearest examples are found, it is possible to use the mode for discrete data, the mean for continuous data, or another measure for imputation [4]. MissForest [35] is a Random Forest (RF) based algorithm that allows for data imputation, regardless of whether the data is numerical or categorical. It first performs an initial imputation using the mean if the data is numerical or the mode if it is categorical. The variable to be imputed is transformed into the dependent variable, while the others become the independent variables. Then, using a random forest, the data is imputed with various trees created until the imputed data value converges. Algorithms like Support Vector Regression (SVR) [11] and Multilayer Perceptron [6] have also been used for imputing missing values. Considering the variables to be imputed as dependent variables and the observed variables as independent variables, they create non-linear regression models for each missing values pattern in the training data. Imputation of data through clustering has also been used in the literature. The main idea is to perform data imputation with the data points belonging to the same cluster. Examples of this can be found in [26, 48]. In [47], a Multi-view Learning (MVL) approach is employed for data imputation from various sources. In this work, the authors create a common feature subspace to bring the data together and enable data from the same classes to be as close as possible, while data from different classes is separated as much as possible. Then, data imputation is performed using the matrix factorization concept. Given the matrix with missing values, denoted as \(X_{M}\), it is proposed that it is possible to obtain \(X_{M} = L_{M} + S_{M}\), where \(L_{M}\) is a low-rank matrix representing the inter-class differences, and \(S_{M}\) is a sparse matrix representing the intra-class differences. In recent years, approaches based on Deep Learning (DL) have been used for data imputation. In most of the works related to data imputation and deep learning, models based on Autoencoders (AE) are the most popular. Examples of this are [1, 23]. There are different variants among the models based on AE. Among them, there are Denoising Autoencoders (DAE) and Variational Autoencoders (VAE). In [15], they use a DAE for data imputation. First, they perform an initial imputation using simple methods such as the mean or mode. Then, they apply DAE multiple times, considering different weight initializations of the network. This way, they aim to follow a multiple imputation strategy like other classical methods. In [28], the authors conduct a survey of data imputation works from 2014 to 2020 using AE-based models. The authors summarize aspects such as loss function, network architecture, parameter tuning used, among other aspects of the reviewed works. The best results are achieved by DAE and VAE when compared to classical methods such as KNN, MICE, SVD, Mean, etc.

Not all imputation methods can work in a predictive manner, meaning that they construct a model that allows for imputing future data. Methods related to matrix factorization approaches, such as SVD and SVT, are clear examples. Most methods are unsupervised, but they may become supervised through certain modifications. For instance, during the training phase the mean imputation method could be considered supervised if the imputation of a missing values point is based on the class to which the point belongs. In [19], this is referred to as conditional mean imputation. In a supervised learning scenario where we have labels for the data, imputation methods may or may not use this information to perform their task. The main issue with this approach is that when we want to impute future missing values, the labels are not available to guide the imputation process, and this is not such an easy problem to solve.

2.2 Automatic classification on ADNI

Initially, the ADNI project consisted of 800 subjects and approximately 78% of whom exhibit some degree of missing values. The obtained data comes from different sources of information (biochemical, genetic, imaging, etc.), so the phenomenon of MV occurs at the source level. Understanding that each source provides several variables to the study, MV would affect an entire block of variables, transforming ADNI into a block-wise data imputation problem.

The subjects are classified into 3 main groups: Alzheimer’s Disease (AD), Mild Cognitive Impairment (MCI), and Healthy Control (HC). MCI can be divided into 2 subgroups: sMCI (stable MCI) and pMCI (progressive MCI). sMCI refers to patients whose condition does not progress to AD, meaning the disease remains stable, while pMCI refers to patients whose condition progresses to AD, indicating an advancement of the disease.

The ADNI project, up to the present day [43], is composed of 4 major stages: ADNI, ADNI-GO, ADNI-2, and ADNI-3 [42]. In each stage, new sources of information are added, as well as more examples, including both new subjects and those who have participated previously. This allows for longitudinal studies aiming to investigate the progression of the disease pattern.

Figure 1 illustrates a scenario exemplifying the situation in ADNI-1 in this work.

Fig. 1
figure 1

ADNI1 has missing values in blocks. Each block represents a source of information. The question marks (?) represent missing values

In [16, 17], they address the data fusion problem. The question here is how to use information from different sources to improve classification. In the works of Zhang and Shen [45, 46], they use three different data sources, but here they perform a feature ranking, which provides information about which features are more important in terms of their discriminatory power between healthy and diseased subjects.

Yuan et al. [44] address the problem of data fusion and MV. They work with four data sources and 780 subjects. MV imputation is performed using four classical methods and one proposed, and classification is carried out using SVM. The problem of multimodality fusion with MV can be found in works such as [20, 38, 49]. Current research on imputing MV in ADNI remains of interest to the community. Examples of these include [2, 3, 36].

3 Preliminaries

A dataset is represented by a matrix \(X_{N\times D}\) composed of N examples and D variables or features. An example i in a classification problem is represented by \(x_{i:}=[x_{i1},x_{i2},\ldots ,x_{ij},\ldots x_{iD}]\), where \(i=1\ldots N\) and \(j=1...D\) is the jth variable. Furthermore, each ith vector has a value \(y_{l} \in {\textbf {Y}} = \{y_{1},y_{2},\ldots ,y_{l},\ldots ,y_{L}\}\), which indicates the lth class to which it belongs from the set \({\textbf {Y}}\).

When dealing with MV, a dataset is represent by the Eq. (1)

$$\begin{aligned} D=\{X_{N\times D},Y_{N\times 1},M_{N\times D}\} \end{aligned}$$
(1)

where \(x_{i:}\) is the ith input vector, Y is the class vector, and \(m_{i:} = [m_{i1},m_{i2},\ldots m_{iD}] \in \{0,1\}^{D}\) indicates which variables in vector i are unknown (one value) and which are observed (zero value). X, in the case of MV, is divided into \(X = \{X_{o},X_{m}\}\) where \(X_{o}\) represents the observable data, and \(X_{m}\) represents the MV.

Little and Rubin [19] define three MV mechanisms: (i) missing completely at random (MCAR): missing values are independent of both observed and unobserved data. MCAR can be expressed by \(p[M\mid X_{o},X_{m},\xi ] = p[M\mid \xi ]\), where \(\xi \) is a vector of unknown parameters. (ii) Missing at Random (MAR): given the observed data, missing values are independent of unobserved data. MAR can be expressed by \(p[M\mid X_{o},X_{m},\xi ] = p[M\mid X_{o}, \xi ]\). (iii) Missing Not at Random (MNAR): missing values depend on the unobserved data. When the data is MCAR or MAR, the MV mechanism can be ignored, simplifying the methods used for MV analysis. For this reason, most studies assume that the MV mechanisms are MCAR or MAR and, for our work, we will assume that the MV mechanism is MAR.

Within the state of the art of the MV problem, there is an algorithm that makes assumptions about the joint distributions of the model’s variables. This algorithm is called Expectation Maximization (EM), proposed by Dempster [10]. The EM algorithm is a general and iterative method for estimating distributional parameters by maximizing likelihood. This method is typically used when some of the samples of random variables are unobserved. In each iteration of the method, two steps are performed: the E (Expectation) step and the M (Maximization) step. The E step is responsible for obtaining the expected likelihood value, given the missing values and the observations. Then, the M step maximizes the expected likelihood with respect to the parameters. Formally, the steps can be expressed using the Eq. (2).

$$\begin{aligned} \begin{aligned} Step\ E: Q(\theta _{t}) = E[l(\theta \vert X, z)]\\ Step\ M: \theta _{t+1} = \underset{\theta }{arg\ \max }\ \ Q(\theta _{t}) \end{aligned} \end{aligned}$$
(2)

where \(\theta _{t}\) is the parameter vector in iteration t, X is the matrix containing all the observations, \(l(\cdot )\) is the log-likelihood function, and z is the vector representing the MV.

Based on the EM algorithm, Tapio Schneider [34] proposes a new version that adds an additional step in each iteration. After the E and M steps, MV imputation is performed based on a regression model in each iteration, where the initial imputation is done using the mean. As a parametric method, it makes assumptions such as the MV are MAR and the data comes from a Gaussian distribution. The imputation of the missing data in the ith vector is provided by a linear regression model, as show in Eq. (3), which relates variables with MV to variables with complete data.

$$\begin{aligned} {\textbf {x}}_{im} = \mu _{m} + ({\textbf {x}}_{io} - \mu _{o})\beta + {\textbf {e}} \end{aligned}$$
(3)

In Eq. 3 let \({\textbf {x}}_{io} \in \mathbb {R}^{1\times p_{o}}\) be a sub-vector with \(p_{o}\) variables with observable data, \({\textbf {x}}_{im} \in \mathbb {R}^{1\times p_{m}}\) is the sub-vector with \(p_{m}\) variables with missing data, \(\mu _{o} \in \mathbb {R}^{1\times p_{o}}\) is the mean vector of the variables with observable data, and \(\mu _{m} \in \mathbb {R}^{1\times p_{m}}\) is the mean vector of the variables with missing data. The matrix \(\beta \in \mathbb {R}^{p_{o}\times p_{m}}\) is a regression coefficient matrix, and the residual \({\textbf {e}} \in \mathbb {R}^{1\times p_{m}}\) is a random vector with a mean of zero and an unknown covariance matrix \(COV \in \mathbb {R}^{p_{m}\times p_{m}}\). The regression coefficient matrix \(\beta \) and the covariance matrix COV are calculated according to the Eq. (4).

$$\begin{aligned} \begin{aligned} \widehat{\beta } = \widehat{\Sigma }^{-1}_{oo}\widehat{\Sigma }_{om}\\ \widehat{COV} = \widehat{\Sigma }_{mm} - \widehat{\Sigma }_{mo}\widehat{\Sigma }_{oo}^{-1}\widehat{\Sigma }_{om} \end{aligned} \end{aligned}$$
(4)

In Eq. (4) let \(\widehat{\Sigma }_{mm}\) be the estimated covariance matrix of variables with MV, \(\widehat{\Sigma }_{oo}\) is the estimated covariance matrix of observable variables, and \(\widehat{\Sigma }_{om} = \widehat{\Sigma }_{mo}^{T}\) is the estimated covariance matrix of variables with both observable and MV.

After the MV have been imputed, the iteration proceeds to update \(\mu \) and \(\Sigma \) using the Eq. (5):

$$\begin{aligned} \begin{aligned} \widehat{\mu }^{t+1} = \frac{1}{N}\sum _{i=1}^{N}x_{i:}\\ \widehat{\Sigma }^{t+1} = \frac{1}{\widetilde{n}}\sum _{i=1}^{n}\{ \widehat{S}_{i}^{t} - (\widehat{\mu }^{t+1})^{T}\widehat{\mu }^{t+1}\} \end{aligned} \end{aligned}$$
(5)

where the new estimator of the mean \(\widehat{\mu }^{t+1}\) is the sample mean, \(\widehat{S}^{t} \equiv E[X_{i}^{T}X_{i}|x_{o};\widehat{\mu }^{t},\widehat{\Sigma }^{t}]\) is the conditional expectation, and \(\widetilde{n}\) is the normalization constant that indicates the degrees of freedom in the covariance matrix of the complete data. For example, if only one vector \(\mu \) is estimated, the number of degrees of freedom is \(\widetilde{n} = n-1\). Furthermore, Schneider proposes a ridge regression (EMreg) for the case where the matrix \(\widehat{\Sigma }\) is singular, and the maximum likelihood estimation of the regression coefficient matrix \(\beta \) is not defined. A more thorough method is presented in [34].

In the context of Supervised Learning, it is common to find datasets with MV. This phenomenon needs to be addressed in order to perform tasks such as predicting a continuous (regression) or categorical (classification) variable. In this scenario, Schneider’s EM algorithm works separately on training and testing data, which is not correct, as any algorithm that operates on a training set should create a model that is subsequently applied to the testing set. This methodology always holds true for any supervised algorithm and should not be exclusive to pre-processing algorithms such as dimensionality reduction and missing value imputation techniques.

In [41], it is emphasized that for dimensionality reduction techniques, it is necessary for these algorithms to have the capability to project future data into a lower-dimensional space. This includes scenarios where future data comes in blocks or individually. Algorithms that possess this ability do so because they have an “Out-of-Sample” extension, in other words, they create a model that allows making predictions for data that were not used during training. In [9], we developed an Out-of-Sample version (EMreg-oos) that can be seamlessly applied in a scenario where both the training set and the testing set have missing values. Once used on the training set, the algorithm creates a general model consisting of as many imputation models as there are missing value patterns in the training set. This algorithm operates in a multi-source scenario which implies that the missing value phenomenon occurs at the level of blocks of variables. While uncommon, it is possible for a new missing value pattern to appear in the test data. To address this, the aforementioned method constructs an imputation model corresponding to the new missing value pattern using the training data. With this data, the missing value pattern is synthetically created to subsequently train the algorithm and generate the new imputation model.

4 Proposal: Regularized EM with recommended neighbors

A longitudinal study [21] revealed that MV in ADNI do not follow the MCAR pattern; instead, they are related to other features in addition to cognitive function. Furthermore, the authors discovered distinct MV mechanisms among various biomarkers and clinical groups. Taking this into consideration, we present an updated version of the EMreg-oos algorithm that incorporates various additional aspects aimed at improving classification after the imputation process. EMreg-oos with recommended neighbors (henceforth EMreg-KNN) is based on the same assumption as the EMreg-oos version, namely, that the data comes from different sources: therefore, missing values are observed in blocks and follow the MAR pattern.

Fig. 2
figure 2

The structure of the EMreg-KNN algorithm

The general structure of the algorithm can be divided into 5 blocks, which are shown in Fig. 2:

  • Block 1: The algorithm takes as input the data matrix with missing values X, the two stopping criteria: \(it_{max}\) and threshold, which reflect the maximum number of iterations of the algorithm and the minimum difference that the imputed data matrix should have from one iteration to another (\(|X^{it+1} - X^{it}|\)) respectively, and the hyperparameter K representing the number of neighbors. For each source s, a distance matrix \(MD_{s} \in \mathbb {R}^{N\times N}\) and a neighbor matrix \(MVe_{s} \in \mathbb {N}^{N\times K}\) are calculated. \(MD_{s}\) will be incomplete if source s has missing values. \(MVe_{s}\) indicates, for each observation, which observations are neighbors.

  • Block 2: Once \(MD_{s}\) and \(MVe_{s}\) are obtained, it is possible to generate the list of recommended neighbors. For each block of missing values belonging to source s that needs to be imputed, other sources recommend their own neighbors to create the neighborhood of the block to be imputed. Figure 3 explains the procedure for an example with 3 sources and 4 observations: the first block of missing values to be imputed is observation 4 from source 1. Distance matrices \(MD_{s}\) are calculated, which will be incomplete as long as there are blocks of missing values (only in the first iteration). Neighborhoods are calculated based on the value of K, which in this case is \(k=2\). Each source \(s\ne 1\) recommends neighbors from its own source to create a neighborhood for the block being imputed. In this case, source 2 recommends examples 1 and 2, and source 3 recommends neighbors 2 and 3. Thus, the recommended neighborhood for the block to be imputed is composed of neighbors 2 and 3. The neighbor 1 recommended by source 2 is not considered because this example has missing values in source 1. Each neighborhood belongs to an example, which is associated with a class \(c_k\). Then, a metric \(\pi (i,s)\) is calculated for the observation i from source s, measuring the proportion of the class of the example within its neighborhood. As a proportion, we have \(0 \le \pi (i,s) \le 1\), where 0 is when the class \(y_l\) is not present among the observations composing the neighborhood, and 1 is when all the observations in the neighborhood have the same class \(y_l\).

  • Block 3: Once the neighborhoods are constructed, they will be used for imputation. For each block of missing values, a data matrix will be constructed that includes all sources of the data to be imputed and their respective neighborhood. This aims to build an imputation model using only the neighborhood data. The imputation model is based on Eq. 3, where it is necessary to store the model parameters \(\mu \) and \(\beta \) to subsequently construct the Out-of-Sample model. This procedure is performed for each block of missing values in the dataset. There will be as many imputation models as there are blocks of missing values. The output of this block is the completely imputed dataset, transitioning the dataset from version it to \(it+1\).

  • Block 4: After all missing values blocks are imputed, it is checked whether iteration \(it < it_{\textrm{max}}\) and if \(|data^{it+1} - data^{it}| > threshold\). If either of these conditions is false, the algorithm proceeds to block 5. If both conditions are true, then the algorithm returns to block 1, where the input is \(data^{it+1}\). Starting from iteration \(it=2\), there are no missing value blocks, therefore, the matrices \(MD_s\) are complete. However, \(MVe_{s}\) may still be incomplete due to the value of K, which was too large to find that number of neighbors for a particular observation. From the second iteration onwards, all requested neighbors can be found according to the value of K. It is important to mention that, iteration by iteration, neighborhoods change based on multiple imputations performed. This continues until reaching the maximum iteration number \(it_{\textrm{max}}\) or until the dataset converges to stable values that do not exceed the threshold threshold.

  • Block 5: When the algorithm proceeds to block 5, a final imputation of all missing values blocks is performed. To do this, neighborhoods with the highest \(\pi (i,s)\) are considered. With this, the final imputed version of the training data is obtained. The neighborhood matrix MVe is a result that could be valuable depending on the task to be performed subsequently. As an example, Manifold Learning techniques [41] such as Isomap [37] or Local Linear Embedding (LLE) [30, 32] use neighborhoods of observations to reduce dimensionality. Clustering techniques such as KNN or DBSCAN [12] also use neighbor information. In general, Unsupervised Learning algorithms commonly can leverage this information. Finally, it is necessary to calculate the imputation model that will allow performing imputations on the testing set, i.e., the Out-of-Sample model. To impute each observation in the testing set, first, the type of missing value pattern must be identified. For each missing value pattern v, there are M imputation models associated for each of the observations having that missing value pattern during training. These different models are used to create an ensemble that allows imputing each example of the testing set. Fig. 4 illustrates the recently explained process for a missing value pattern v, where each model depends on its neighborhood, a centroid, and regression coefficients. The centroid of each model is a representative vector of the neighborhood constructed based on the mean of each variable; that is, it is a vector of means. This vector, which we will call \(r_{m}^v\), will be the representative vector of model m with missing value pattern v and will be used to calculate the distance between this vector and the ith test vector \(x_{i}^{\textrm{test}}\) that needs to be imputed. The distance calculation is performed according to Eq. 6.

    $$\begin{aligned} d_{mi}^{v} = D(r_{m}^v, x_{i}^{\textrm{test}}) \end{aligned}$$
    (6)

    where \(D(\cdot )\) is a distance function applied only to the variables with observed data from the test vector and the centroid of model m with missing value pattern v. In this case, we use the Euclidean distance, but it could be any distance. Therefore, the imputation of the test data, considering that the pattern of MV v contains M imputation models, is given by Eq. (7):

    $$\begin{aligned} \begin{aligned} x_{i}^{test}=&\sum _{m=1}^{M}W_{mi}^{v}\cdot f_{m}^{v}(X,\beta )\\ W_{mi}^{v} =&\frac{(1-H_{mi}^{v})}{M-1}\\ H_{mi}^{v} =&\frac{d_{mi}^{v}}{\sum _{m=1}^{M}d_{mi}^{v}} \end{aligned} \end{aligned}$$
    (7)

    where \(W_{mi}^{v}\) is a weighting factor associated with the imputation model m with respect to test observation i, \(f_{m}^{v}(\cdot )\) is a regression model m as the Eq. (3). Both have missing value pattern v. \(H_{mi}^{v}\) is a weighting factor, which, like \(W_{mi}^{v}\), satisfies the condition: \(\sum _{m=1}^{M}H_{mi}^{v} = 1\). The difference between the two is that \(H_{tm}^{V}\) weights more when the distance \(d_{mi}^{V}\) is greater, and \(W_{mi}^{V}\) weights more when the distance \(d_{mi}^{V}\) is smaller. In this manner, each regression model will be weighted with a greater factor when the test data is closer to the representative vector \(r_{m}^v\). The denominator of \(W_{mi}^{v}\) is \(M-1\) because, since \(\sum _{m=1}^{M}H_{mi}^{v} = 1\), \(\sum _{m=1}^{M}(1-H_{mi}^{v}) = M - \sum _{m=1}^{M}H_{mi}^{v} = M - 1\).

Fig. 3
figure 3

Procedure for calculating recommended neighbors. Example for 3 sources and 4 observations

Fig. 4
figure 4

The Out-of-Sample version is performed through an ensemble of regression models that were constructed during the training phase

The training stage of algorithm EMreg-KNN is described in Algorithm 1.

Algorithm 1
figure a

EMreg-Knn algorithm: Training stage

5 Experiments

5.1 Dataset

The ADNI-1 study offers a database of multimodal data for 819 subjects, including 229 healthy control (HC) participants with normal cognition, 397 with mild cognitive impairment (MCI), and 193 with mild Alzheimer’s disease. Among individuals with MCI, they are further categorized into two groups: those who remained in a stable condition (sMCI) and those who later progressed to Alzheimer’s disease (pMCI). In this work, we examine three initial ADNI-1 modalities: cerebrospinal fluid (CSF), magnetic resonance imaging (MRI), and positron emission tomography (PET). These modalities were preprocessed following the procedures in reference [16], resulting in the exclusion of 43 out of 819 subjects due to quality control issues. The CSF dataset includes three variables that assess the levels of certain proteins and amino acids critically associated with Alzheimer’s disease. The MRI dataset offers volumetric characteristics of 83 brain anatomical regions. The PET dataset, using the FDG radiotracer, offers an assessment of average brain function, specifically the rate of cerebral glucose metabolism, within these 83 anatomical regions. Consequently, each subject is represented by a total of 169 features. Table 1 shows details of the data distribution.

Table 1 The first column displays the number of examples for each class, while the subsequent columns indicate the number of examples with MV for each modality

5.2 Experimental setup

Following [9, 36, 44] and as addressed in the literature, we consider four different classification sub-problems. Namely, distinguishing between (a) healthy subjects and subjects with Alzheimer’s disease, (b) healthy subjects and subjects in a Mild Cognitive Impairment pre-dementia state, (c) subjects having either a stable or progressive Mild Cognitive Impairment, and (d) healthy subjects, subjects with Alzheimer’s disease and subjects with Mild Cognitive Impairment. Table 2 shows the number of examples and subject categories employed in each of these experimental settings.

Table 2 Number of instances per class in each of the four sub-problems addressed in the experiments

The same experimental strategy is applied on each of the four classification scenarios previously mentioned, this is the data preprocessing and model selection procedures.

5.2.1 Data preprocessing

In order to perform model selection and estimation of the prediction error, the full dataset is randomly partitioned into train and test subsets. Subsequently, data features are scaled by applying Z-score normalization. This is, values of each feature j in the training partition are transformed by subtracting its mean \(\mu _{j}^{train}\) and dividing by its standard deviation \(\sigma _{j}^{train}\). Given that all standardization should adhere to the Out-of-Sample strategy, each feature j in the test partition is scaled as described in Eq. 8.

$$\begin{aligned} X_{\textrm{std},j}^{\textrm{test}} = \frac{X_{\textrm{raw},j}^{\textrm{test}} - \mu _{j}^{\textrm{train}}}{\sigma _{j}^{\textrm{train}}} \end{aligned}$$
(8)

5.2.2 Model selection

Given that our final objective is to perform a satisfactory classification and that there is no ground-truth for the missing values in ADNI, model selection is performed by following a classification-based criteria instead of the achieved imputation quality. In consequence, an estimation of the classification accuracy is built by performing 50 experimental runs each one according to the following procedure: random splitting of the dataset into train/test stratified partitions, \(75\%\)/\(25\%\) respectively; 5-Fold cross-validation estimate is computed over the train partition to select the best hyper-parameter values of the imputation and classification methods combined; and finally, the chosen model for imputation and classification is applied over the test partition.

5.2.3 Performance benchmark

To evaluate the improvement in the classification task over ADNI by EMreg-KNN, we compared it against five state of the art techniques: Two EM based, namely, regularized EM (EMreg) [34] and EMreg-oos [9] and three widely used methods in the imputation literature, namely MissForest [35], MICE [40] and SVDimpute [39]. In order to perform a fair comparison among all aforementioned imputation methods irrespective of the classification algorithm employed, several different techniques are used, namely K-NN, \(\nu \)-SVM, Random Forest and Feedforward Artificial Neural Networks.

In addition to the performance in classification over the imputed data, we include baseline models built for each sub-problem over a reduced dataset having all subjects with missing entries removed (no imputation is done).

5.3 Results and discussion

The results attained with all combinations of imputation and classification methods in each of the four above-mentioned sub-problems are shown in Tables 3, 4, 5 and 6. Reported metric values correspond to the average performance obtained over 50 runs along with its standard deviation between parentheses. Bold values in each section of the tables reflect the imputation method that achieves the best results in conjunction with the corresponding classifier. Additionally and in order to verify if the methods are better than the base EM algorithm we used to create our proposal, a symbol * is used to indicate when the difference between the original EMreg algorithm and the method that achieves the best result is statistically significant. This contrast is calculated by applying a t-Student mean difference test with \(\alpha = 0.05\).

Table 3 AD/HC multi-modality classification accuracy (acc.), area under the curve (AUC), sensitivity (sens.), specificity (spec.), and F-measure (F) for different methods

In the AD vs HC scenario, Table 3, it can be observed that EMreg-KNN obtains the highest results in 3 out of the 4 classifiers considering accuracy y F-measure. Only with RF, EMreg-oos is slightly better. Nevertheless, the highest specificity is attained by EMreg-KNN which denotes a greater ability to distinguish the healthy subjects. Observing the standard deviation, we can say that in almost all cases, our proposal has the lower value. This means that our algorithm is more robust.

Table 4 MCI/HC multi-modality classification accuracy (acc.), area under the curve (AUC), sensitivity (sens.), specificity (spec.), and F-measure (F) for different methods

Results in the MCI vs HC scenario are shown in Table 4. EMreg-KNN obtains the best results with all classifiers in accuracy, sensitivity and f-measure. The difference is statistically significant for 3 out of the 4 classifiers. In spite of the relatively low sensitivity value, it is important to notice that the proposal shows the greatest ability to identify diseased subjects. Considering the AUC metric, we outperform in 2 out of the 4 classifiers, but the differences between the best result and our proposal are very small in the other 2 classifiers. We should also consider that this scenario has a degree of imbalance, so we can take a closer look at the F-measure. In this metric, as mentioned before, we have the best result in all cases. Considering the standard deviation, our proposal is better only in RF and ANN.

Table 5 pMCI/sMCI multi-modality classification accuracy (acc.), area under the curve (AUC), sensitivity (sens.), specificity (spec.), and F-measure (F) for different methods

When trying to differentiate between the stable and progressive states of Mild Cognitive Impairment in Table 5, it can be observed that EMreg-KNN achieves the best results in accuracy and AUC metrics with all classifiers. Considering AUC is statistically significant in SVM, RF and ANN. The sensitivity is higher for the "None" case in all classifiers. This is a strange result and the only one in all our experiments. Something similar happened with the F-measure. Clearly, the impact of removing examples with MV favors the calculation of these metrics. Despite this, as previously highlighted, the accuracy is better in our proposal, and in a scenario with such a small imbalance, this carries more significance. Regarding the standard deviation, the only clear statement is that the "None" case has the highest value in all instances. This diminishes the value of the good results that this case achieved in some metrics. There is no clear winner among the imputation methods in terms of their stability or robustness.

Table 6 AD/MCI/HC multi-modality classification accuracy (acc.), area under the curve (AUC), sensitivity (sens.), and F-measure (F) for different methods

For the last sub-problem, we assess the performance in the classification of AD vs HC vs MCI. Table 6 shows that EMreg-KNN achieves the best results with all classifiers in all metrics. Regarding the standard deviation, it is observed that the proposal shows a sustained improvement compared to EMreg. Since this is the only non-binary classification setting, all reported scores are macro-averaged.

As a general first impression, it is observed that imputation methods improve classification results by considering all available data, despite the possibility of introducing some bias. Results with non-imputed data demonstrate a consistently lower performance and higher dispersion across all metrics.

In a few instances and for a few metrics, the SVDImpute method shows better results. However, it should be noted that this method does not create an imputation model, and its only option is to work with the complete data matrix to impute missing values.

We can observe that in almost all cases, algorithms based on the EM approach deliver better results. Among them, EMreg-oos and EMreg-KNN outperform their competitors even in a statistically significant manner. Among these last two methods, our proposal (EMreg-KNN) exhibits the best overall results, considering various metrics, classifiers, and subproblems. Moreover, in the most challenging subproblem shown in Table 6, our algorithm is the best in 100% of the cases.

Finally, analyzing the standard deviation, we can say that imputation methods allow for much more stable or robust classification results, which translates to a low standard deviation. This is observed in all cases. Among the results using imputation methods, our proposal delivers very robust outcomes. Although it may not always be the winner in this metric, it consistently ranks among the best with a very small difference from the best result obtained in each case.

6 Conclusions and future work

In this work, we have introduced EMreg-KNN, a novel data imputation algorithm capable of dealing with missing values originated in different sources of information. Although our algorithm is based on Schneider’s EMreg, EMreg-KNN presents several improvements: (1) allows the creation of imputation models (Out-of-Sample version), which can be used in a supervised learning scenario. The original EM algorithm and matrix factorization techniques can not do this; (2) imputation for each block is based on a regression ensemble constructed from instance neighborhoods, enhancing classifier performance; (3) creating neighborhoods during training enables extending this algorithm for use in subsequent clustering processes with only a slight modification in the \(\pi \) metric (defined in Sec. 4, Block 5).

We can observe two limitations of our algorithm: (1) The training phase is computationally more expensive and (2) The hyperparameter K, which defines the number of neighbors in the neighborhoods, must be tuned. This implies a greater effort to achieve the best results.

We have observed how imputation techniques enable the use of extra information that would otherwise be discarded. This helps to enhance the differentiation between various diagnostic groups. By developing biomarkers using more evidence, there is potential for more precise diagnoses and prognoses for Alzheimer’s patients. Our findings revealed that training classifiers using imputed data outperforms creating a predictive model with a smaller set of subjects having complete records. This conclusion is reinforced by the enhancement of both performance metrics and classifier robustness across all imputation techniques. We compared our algorithm with previous versions of the EM algorithm and others in the literature. The results demonstrate that our proposal achieves better outcomes in practically all scenarios, often showing a statistically significant difference. We aimed to test four classifiers of different nature precisely to prove that the results were independent of the classification algorithms.

Our future work aims to focus on: (1) Study and calculate the computational complexity and convergence properties of the algorithm; (2) Incorporate mechanisms into the algorithm to handle heterogeneous data, such as using distance functions other than Euclidean distance for neighborhood calculations. All of this thinking about being able to use other data sources from ADNI and other datasets with similar characteristics.