Background

Next generation sequencing technologies have revolutionized high-throughput analysis of the transcriptome. However, zero values present an inherent problem when analyzing the expression matrices generated through these techniques. When transcripts are relatively high in some samples, but not in others of the same type, or when the dimensionality of the data is high, technical zeros are even more likely to happen. Distinguishing technical zeros from true biologic null expression is essential for correct data interpretation.

To highlight the importance of data imputation, in terms of retaining classification accuracy, we consider the example problem of classifying images of handwritten digits. Classifying images of handwritten digits is a well studied problem in machine learning, and the test accuracy exceeds \(99\%\) using the state-of-the-art models [1]. See Fig. 1a, where we have shown an example, synthetic image of a handwritten 1. The same image, but with some missing pixels, is shown in Fig. 1b. The locations of the missing pixels are selected at random, and uniformly. If we wish to classify the images with missing pixels, then it is ill-advised to perform no data imputation (i.e., imputing zeros), as the accuracy would suffer. See Fig. 1c, where we have shown the effect of imputing zeros on the AUC, classification accuracy (ACC) and \(F_1\) score with the percentage of missing pixels. We see a decrease in classification accuracy and \(F_1\) score when more than \(10\%\) of the pixels are missing, and the reduction in accuracy is more pronounced as the percentage of missing pixels increases.

Fig. 1
figure 1

a Example image of a handwritten number 1. the same image but with missing pixels c AUC, classification accuracy and \(F_1\) score with % of missing pixels

Several strategies have been described for data imputation in gene expression and miRNA expression analysis [2,3,4,5,6,7,8,9,10,11]. Two popular techniques are “VIPER” and “scImpute.” In [3], the authors introduce “VIPER”, which implements data imputation on gene expression data using a combination of lasso (or an elastic net), with a box constrained regression. That is, first a set of neighboring cells are found, which have related expression values to the missing cell, using lasso (or elastic nets). Then a box constrained regression is performed on the selected neighbors to fill in the missing gene expression values. The reason for using lasso as preprocessing, is that the quadratic programming code employed in [3] for box constrained regression does not scale well to large matrices, and thus lasso (or an elastic net) is used to select a subset of candidate nearest neighbors to reduce the array size before nonnegative regression. In [9], the authors introduce “scImpute”, which shares similarities in the intuition to VIPER. First, in the scImpute algorithm, the cells are clustered into K groups using spectral clustering [12]. Then, the missing cell expressions are reconstructed from their neighboring cells by a nonnegative constrained regression. That is, the missing values are imputed using nonnegative linear combinations (i.e., a linear combination with nonnegative coefficients) of their nearest neighbors, where the neighboring cells are determined by spectral clustering. We choose to focus on VIPER (specifically the lasso variant) and scImpute for comparison here, as they share the most similarities with the proposed method.

Here we present a novel, fast method for data imputation using constrained Conjugate Gradient Least Squares (CGLS) borrowing ideas from the imaging and inverse problems literature. As an example of a desired application for this work, we present miRNA expression analysis, with a particular focus on cancer prediction. As shown below, highly accurate cancer prediction is possible using simple classifiers (e.g., a softmax function is used here), on a wide variety of data sets, in the case when all (or a large fraction) of the miRNA expression values are known, and there is little to no missing data. It is not always possible to measure all expression values contained in the training set, for every patient, however. To combat this, we aim to impute the missing values using the known expressions, so that we can retain use of our accurate model fit to the full set of miRNA available in the training set. We propose to reconstruct the missing data via nonnegative constrained regression, but with the further constraint that the regression weights sum to 1. Such constraints ensure that the imputed values lie within the range of the training data, with the idea to prevent overfitting. We enforce the regression weights to sum to 1 as a hard constraint in our objective, so that the nonnegative least squares and weight normalization steps are carried out simultaneously. To solve our objective, we apply the nonnegative Conjugate Gradient Least Squares (CGLS) algorithm of [13], typically applied in inverse problems and image reconstruction. The CGLS code we apply does not suffer the scaling issues encountered in, e.g., VIPER, for large matrices, and can process efficiently large scale expression arrays. The algorithm we propose offers a fast, efficient, and accurate imputation without the need for preprocessing steps, e.g., as in VIPER and scImpute. Our method is also completely nonparametric, and thus requires no tuning of hyperparameters before imputation, in contrast to scImpute and VIPER which require that two hyperparameters be tuned. Such parameters may be selected, for example, by cross validation, as is suggested in [3]. However, cross validation is slow, particularly for large data, and is thus impractical for clinical applications. To demonstrate the technique, we test the performance on miRNA expression data publicly available in the literature, and give a comparison to VIPER and scImpute. Specifically, as a measure of performance, we focus on how effectively each method retains the classification accuracy with the percentage of missing data (as in the curves shown in Fig. 1c). The proposed method is shown to be orders of magnitude faster than VIPER and scImpute, with greater than or equal accuracy, for the examples of interest considered here in cancer prediction.

Results

The method proposed here will be denoted by Fast Linear Imputation (FLI), for the remainder of this paper. The FLI algorithm and the core objective functions are discussed in detail in the appendix, section “Description of FLI”. In this section, we present a comparison of FLI, and the methods VIPER [3] and scImpute [9] from the literature on publicly available miRNA expression data [14,15,16] and synthetic handwritten image data. The specific implementations of VIPER and scImpute used here are discussed in the appendix, section “Implementation of VIPER and scImpute”. FLI is also compared against (unconstrained) regression, mean, and zero imputation as baseline. The classification model, selection of hyperparameters, and classification metrics are detailed in the appendix.

Synthetic handwritten image results

In this section, we present our results on the synthetic handwritten image data discussed in the introduction. The handwritten image data is included as a visual example to show clearly how the imputation methods are performing and to give some intuition as to why some methods perform better than others. For more details on this data see section “Data sets”.

Fig. 2
figure 2

Handwritten image data results. ac AUC, ACC and \(F_1\) scores with percentage of missing dimensions. df mean (\(\epsilon _{\mu }\)), standard deviation (\(\epsilon _{\sigma }\)) and maximum (\(\epsilon _{M}\)) imputation errors over all test patients, with percentage of missing dimensions. The method is given in the figure legend

In Fig. 2a–c we present plots of the AUC, ACC and \(F_1\) scores with the percentage of missing dimensions, for each method. Figure 2d–f show the corresponding plots of the mean, standard deviation and maximum imputation errors, over all test patients. The plots in Fig. 2e, f are cropped on the vertical axis to better highlight the errors corresponding to the more competitive methods. This cuts off the end of error curve corresponding to scImpute, which spikes when \(\approx 95\%\) of the dimensions are missing. In Table 1, we present the average values over the curves in Fig 2a–f, as a measure of the average performance over all possible levels of missing data. The mean, standard deviation, and maximum imputation times, over all test patients, are given in Table 1c.

Table 1 Handwritten image data results

In terms of retaining the classification accuracy, FLI, VIPER, and scImpute offer comparable performance. FLI and VIPER are joint best and offer mean AUC, ACC, and \(F_1\) scores exceeding \(99\%\). For the baseline methods, namely (unconstrained) regression, mean, and zero imputation, we see a reduction in the classification accuracy, and the reduction is more pronounced when \(>70\%\) of the dimensions are missing, as evidenced by the curves in Fig. 2a–c. In terms of the imputation error, FLI offers the most consistent imputation accuracy, when compared to VIPER and scImpute, in the sense that FLI offers the smallest standard deviation and maximum errors.

Fig. 3
figure 3

Example image reconstructions of the one image shown in the introduction, using all methods considered. The number of missing pixels is 550, which is \(550/784=70\%\) of all pixels. The ground truth is also shown for comparison

For an example image imputation, see Fig. 3. where we have shown image reconstructions of the handwritten one image discussed in the introduction (Fig. 1a). We see ghosting artifacts in the VIPER reconstruction, and a significant blurring effect in the scImpute reconstruction. The regression imputation appears overfit, and introduces severe artifacts. FLI offers the clearest and sharpest image, with relatively few artifacts. So, in some cases, there are artifacts introduced by the VIPER and scImpute reconstructions. While this is not enough to confuse the classifier (i.e., the classification accuracy is still retained), the imputation error is less consistent when compared to FLI. In particular, the average maximum imputation error offered by FLI, over all levels of missing dimensions, is \(17\%\) lower than the next best performing method, namely VIPER. See the third row of Table 1b. FLI is also orders of magnitude faster than VIPER and scImpute, as indicated by the imputation times of Table 1c.

Singapore results

Here we present our results on the miRNA expression data of Chan et. al. [16], collected from Singaporean patients. This data includes significant batch effects due to different measurement technologies. See section “Data sets” for more details.

See Fig. 4a–c for plots of the classification accuracy, and Fig. 4d–f for the imputation errors with the percentage of missing dimensions. See Table 2 for the mean values over the curves in Fig. 4a–f, and the mean, standard deviation, and maximum imputation times. In this example, FLI offers the best performance in terms of retaining the AUC, ACC and \(F_1\) score, on average, across all levels of missing dimensions. As evidenced by Fig. 4a, FLI offers the highest AUC over all levels of missing dimensions. We see a similar effect in the ACC and \(F_1\) score curves of Fig. 4b, c, although, in a minority of cases, scImpute slightly outperforms FLI. The retention of the classification accuracy is significantly reduced using regression, mean and zero imputation, when compared to FLI, VIPER, and scImpute.

Fig. 4
figure 4

Singapore data results. a-c AUC, ACC and \(F_1\) scores with percentage of missing dimensions. df mean (\(\epsilon _{\mu }\)), standard deviation (\(\epsilon _{\sigma }\)) and maximum (\(\epsilon _{M}\)) imputation errors over all test patients, with percentage of missing dimensions. The method is given in the figure legend

FLI offers the most optimal performance in terms of the mean, standard deviation, and maximum imputation error, across all levels of missing dimensions, when compared to VIPER and scImpute. A zero imputation offers the best maximum and standard deviation error over all methods. The mean error offered by zero imputation is significantly higher than that of FLI, scImpute and VIPER, however. We would expect the standard deviation of a zero imputation to be low, as there is not much variation among the imputations (i.e., many of the values are zeros). The maximum error curves of Fig. 4f indicate that, for some patients, the imputation error is high when using FLI, VIPER and scImpute, as simply imputing zeros offers less error. Such erroneous patients can be considered outliers, and do not greatly effect the overall classification accuracy, as evidenced by the plots of Fig. 4a–c.

Table 2 Singapore data results

The imputation time offered by FLI is orders of magnitude faster than VIPER and scImpute. For example, FLI is approximately three orders of magnitude faster, in terms of mean imputation time, when compared to VIPER, which was the next best performing method in terms of AUC, ACC and \(F_1\) score, after FLI. The imputation time offered by FLI is also more consistent when compared to VIPER, as evidenced by the \(t_{\sigma }\) scores. When compared to scImpute, FLI is approximately one order magnitude faster in terms of mean and maximum imputation time, and is more consistent with lower standard deviation. Regression, mean and zero imputation are the fastest methods, but at the cost of accuracy.

This example was included given the presence of significant batch effects, as discussed at the beginning of this section, and in more detail in section “Data sets”. This example provides evidence that FLI is most optimal (compared to similar methods such as VIPER and scImpute), in terms of accuracy and imputation time, when imputing data in the presence of batch effects.

Korea results

Here we present our results on the miRNA expression data of Lee et. al. [17], collected from Korean patients. For more details on this data see section “Data sets”, point (3).

Fig. 5
figure 5

Korea data results. ac AUC, ACC and \(F_1\) scores with percentage of missing dimensions. df Mean (\(\epsilon _{\mu }\)), standard deviation (\(\epsilon _{\sigma }\)) and maximum (\(\epsilon _{M}\)) imputation errors over all test patients, with percentage of missing dimensions. The method is given in the figure legend

Table 3 Korea data results

In Fig. 5a–c we present plots of the AUC, ACC and \(F_1\) scores with the percentage of missing dimensions, for each method. Figure 5d–f show the corresponding plots of the mean, standard deviation and maximum imputation errors. The plots in Fig. 5d, f are cropped to \(\epsilon =0.35\) on the vertical axis to better highlight the errors corresponding to the more competitive methods. Thus, parts of some of the error curves are missing in Fig. 5d, f. For example, in most cases (i.e., for most levels of missing dimensions considered), the zero imputation mean and maximum error exceeds 0.35 and thus why much of the light blue curves corresponding to zero imputation are missing in the plots. In Table 3a, b, we present the average values over the curves in Fig. 5a–f. The mean, standard deviation, and maximum imputation times, over all test patients, are given in Table 3c.

In this example, FLI, VIPER, and scImpute offer similar levels of performance in terms of retaining the classification accuracy, with FLI offering the best average AUC, and scImpute the best average ACC and \(F_1\) scores. A standard regression imputations is also effective in retaining the classification accuracy up to approximately \(85\%\) of dimensions missing, and is comparable to FLI, VIPER, and scImpute within this range. We see a sharp reduction in accuracy in the regression curves (i.e., the purple curves of Fig. 5a–c) when more than \(85\%\) of the dimensions are missing, however, and regression significantly underperforms FLI, scImpute, and VIPER at this limit. For mean and zero imputation, we see a more gradual reduction in accuracy, when compared to regression. The imputation errors offered by FLI, VIPER, scImpute, and mean imputation are comparable, and outperform regression and zero imputation. When compared to VIPER and scImpute, FLI offers an imputation time which is orders of magnitude faster, in terms of mean imputation time. The imputation time offered by FLI is also more consistent, with lower standard deviation and maximum, when compared to VIPER and scImpute. As was the case in the previous examples, regression, mean, and zero imputation are the fastest methods, but at the cost of accuracy.

Discussion

In this paper we introduced FLI, a fast, robust data imputation method based on constrained least squares. To illustrate the technique, we tested FLI on synthetic handwritten image and real miRNA expression data sets, and gave a comparison to two similar methods from the literature, namely VIPER [3] and scImpute [9]. We also compared against (unconstrained) regression, mean, and zero imputation as baseline. The results highlight the effectiveness of FLI in retaining the classification accuracy in cancer prediction applications using miRNA expression data, and in image classification. When compared to VIPER and scImpute, FLI was shown to offer greater than or equal imputation accuracy, with imputation speed orders of magnitude faster than scImpute and VIPER, in all examples considered. VIPER, scImpute, and FLI significantly outperformed regression, mean and zero imputation in terms of imputation accuracy, in all examples considered, but were slower given the greater computational complexity. For further validation of FLI on two more real miRNA expression data sets, see appendix 6.

In section “Singapore results”, we considered an example expression data set collected from Singaporean patients, which included significant batch effects. When batch effects were present, FLI was shown to outperform VIPER and scImpute in terms of retaining the classification accuracy, and imputation error. On the handwritten image and Korean data sets, considered in sections “Synthetic handwritten image results” and “Korea results”, such batch effects were not detected. In these examples, the imputation accuracy offered by FLI, VIPER, and scImpute was comparable. This study provides evidence that FLI offers optimal imputation accuracy, when compared to the methods of literature, on batch data. This is important, since batch effects are common in medical data [18] and thus an imputation which is effective in combating batch effects, without the need for a-prioiri batch correction steps, is desirable.

In all examples considered, FLI was shown to be orders of magnitude faster than VIPER and scImpute. FLI is also completely nonparametric, and thus more straightforward to implement, in contrast to VIPER and scImpute, which require the tuning of two hyperparameters. It is suggested in [3] to tune the lasso parameter used by VIPER via cross validation. The reason for using lasso as preprocessing in VIPER, is that the quadratic programming code employed in [3] for nonnegative regression does not scale well to large matrices, and thus lasso (or elastic nets) are used to select a subset of candidate nearest neighbors before nonnegative regression. A similar intuition is used in [9] in scImpute, whereby the training data is clustered into K groups using spectral clustering [12] before nonnegative regression. That is, the test samples are imputed using linear combinations of their nearest neighbors, where the neighbors are determined a-priori by spectral clustering. The algorithm we propose does not suffer such scaling issues for large matrices, and does not require any preprocessing steps before imputation. Our method is also completely nonparametric, and thus requires no tuning of hyperparameters before imputation, in contrast to scImpute and VIPER which require that two hyperparameters be tuned. Cross validation is slow, however, and resulted in long imputation times (in the order of minutes) when using VIPER. As noted by the authors in [3], the quadratic programming algorithm, used to implement box constrained regression, is slow, and thus why lasso preprocessing is proposed. The nonnegative least squares code of scImpute, applied in [9], also suffers efficiency issues. To combat this, the authors proposed to limit the number of regression weights a-priori using spectral clustering. FLI does not suffer such efficiency concerns, and requires no tuning of hyperparameters or preprocessing steps a-priori. FLI thus offers a faster and more straightforward imputation, when compared to VIPER and scImpute. This is important in applications where large numbers of samples need to be processed quickly (e.g., large gene expression arrays). In such applications, FLI offers the most practical imputation time, in comparison to VIPER and scImpute.

Conclusions and further work

The technique FLI proposed here offers accurate and fast imputation for miRNA expression data. In particular, the imputation offered by FLI was sufficiently accurate to retain the classification accuracy in cancer prediction and handwritten image recognition problems when a large proportion (up to \(85\%\)) of the dimensions were missing. Thus, FLI offers an effective means to classify samples with missing data, without the need for model retraining.

The application of focus here is miRNA expression analysis and cancer prediction. Since miRNA expression variables are highly correlated, as indicated by the plots in Fig. 6, FLI is ideal for miRNA expression data. FLI is not exclusive to miRNA however, and is generalizable to any data set with highly correlated variables. This is validated by the multivariate normal data results in section “Multivariate normal examples” in the appendix. The current iteration of FLI requires a full training set (i.e., with no missing data) for the imputation, which can be considered a limitation of FLI. In further work we aim to address this limitation and develop FLI for more general missing data problems, and test further the generalizability of FLI on a variety of expression technologies commonly applied in cancer prediction (e.g., single-cell RNA, mRNA, protein expression, metabolite expression).

In this study, we assumed the locations of the missing data points to be random and uniform. In practice, the distribution of drop out events may be nonuniform. For example, in miRNA sequencing, the lower limit of detection is related to sequencing depth, thus within the technical zero range there may be a broad range of true expression values. We hypothesize that such expressions will more frequently be drop outs, when compared to more significantly expressed miRNA. In further work, we aim to test the effectiveness of FLI, and the methods of the literature, in the case when the drop out distribution is nonuniform, once the distribution of drop out events is decided upon.