Pointed Subspace Approach to Incomplete Data

Incomplete data are often represented as vectors with filled missing attributes joined with flag vectors indicating missing components. In this paper, we generalize this approach and represent incomplete data as pointed affine subspaces. This allows to perform various affine transformations of data, such as whitening or dimensionality reduction. Moreover, this representation preserves the information, which coordinates were missing. To use our representation in practical classification tasks, we embed such generalized missing data into a vector space and define the scalar product of embedding space. Our representation is easy to implement, and can be used together with typical kernel methods. Performed experiments show that the application of SVM classifier on the proposed subspace approach obtains highly accurate results.


Introduction
Incomplete data analysis is an important part of data engineering and machine learning, since it appears in many practical problems.In medical diagnosis, a doctor may be unable to complete the patient examination due to the deterioration of health status or lack of patient's compliance (Burke et al., 1997); in object detection, the system has to recognize the shape from low resolution or corrupted images (Berg et al., 2005); in chemistry, the complete analysis of compounds requires high financial costs (Stahura and Bajorath, 2004).In consequence, the understanding and the appropriate representation of such data is of great practical importance.
A missing data is typically viewed as a pair (x, J x ), where x ∈ R N is a vector with missing components J x ⊂ {1, . . ., N }.In the most straightforward approach, one can fill missing attributes with some statistic, e.g.mean, taken from existing data.Although such a strategy can be partially justified when the features are missing at random, we lose the knowledge about unknown attributes 1 .To preserve this information we usually add a flag indicating which components were missing.More precisely, we supply x with a binary vector 1 Jx , in which 1 denotes absent feature while 0 means the present one.
Summarizing, we perform the embedding (x, J x ) → (x, 1 Jx ) of missing points into a vector space of extended complete data.This allows us to apply typical classification tools, 1.In the medical data, typically some component is missing if the state of the patient is so bad, that a given numerical procedure cannot be performed.Consequently, the knowledge that given component is missing could say a lot about the state of the patient.
like SVM, with the scalar product defined by In practical classification problems we usually perform various affine transformations of data, as whitening or dimensionality reduction, before training a classifier.Moreover, we may know that the data satisfy some affine constraint.It is nontrivial how to modify the flag vectors so as to keep the correspondence with such affine transformations.Thus, our main problem behind the paper can be stated as follows: How to transform the flag vectors indicating the missing components if we perform the linear (or affine) mapping of data?
In this contribution, we show that the answer can be given by viewing the incomplete data as pointed affine subspaces, i.e. the subspace with a distinguished point called basepoint.We first observe that a pair (x, J x ) can be formally associated with a pointed affine subspace of R N : x + span(e j ) j∈Jx , where (e j ) N j=1 denotes the canonical base of R N and x is a selected basepoint.In other words, this is the set of all points which coincide with the representative x on the coordinates different from J x .In consequence, by a generalized missing data point in R N we understand a pointed affine subspace S x = x + V of R N , where x ∈ R N is a basepoint and V = S x − x is a linear subspace.Since the basepoint can be selected with a use of various imputation techniques, we propose to choose the most probable point of S x , i.e. to project a dataset mean onto S x with respect to Mahalanobis scalar product given by the covariance of data.
Such a definition allows us to efficiently extend linear and affine operations from the standard points to missing ones, by taking the image of the subspace and the point.For example, a linear mapping F : w → Aw +b, can be extended to the case of pointed subspace Given an affine constraint W , we restrict2 x + V by the formula There appears another question: how to work with such data, and in particular how to embed the generalized missing data into a vector space in such a way to respect the scalar product (1) given by the flag embedding?Our main observation shows that this can be achieved by identifying a linear subspace V with an orthogonal projection p V : R N → V by considering the embedding (x, V ) → (x, p V ) ∈ R N × R N ×N .We show that the scalar product of embeddings coincides with (1), i.e.
The paper is organized as follows.The next section covers the related approaches to incomplete data analysis.In third section, we define the generalized missing data, present a strategy of embedding such data into a vector space and propose a new imputation method.We also define a scalar product for such embeddings and show its connections with existing flag approach.In fourth section, we illustrate our method with sample classification results.

Related works
The most common approach to learning from incomplete data is known as deterministic imputation (McKnight et al., 2007).In this two-step procedure, the missing features are filled first, and only then a standard classifier is applied to the complete data (Little and Rubin, 2014).Although the imputation-based techniques are easy to use for practitioners, they lead to the loss of information which features were missing and do not take into account the reasons of missingness.To preserve the information of missing attributes, one can use an additional vector of binary flags, which was discussed in the introduction.
The second popular group of methods aims at building a probabilistic model of incomplete data which maximizes the likelihood by applying the EM algorithm (Ghahramani and Jordan, 1994;Schafer, 1997).This allows to generate the most probable values from obtained probability distribution for missing attributes (random imputation) (McKnight et al., 2007) or to learn a decision function directly based on the distributional model.The second option was already investigated in the case of linear regression (Williams et al., 2005), kernel methods (Smola et al., 2005;Williams and Carin, 2005) or by using second order cone programming (Shivaswamy et al., 2006).One can also estimate the parameters of the probability model and the classifier jointly, which was considered in (Dick et al., 2008;Liao et al., 2007).This techniques work very well when the missing data is conditionally independent of the unobserved features given the observations, but there is no guarantee to get a reasonable estimation in more general missing not at random case.
There is also a group of methods, which does not make any assumptions about the missing data model and makes a prediction from incomplete data directly.In (Chechik et al., 2008) a modified SVM classifier is trained by scaling the margin according to observed features only.The alternative approaches to learning a linear classifier, which avoid features deletion or imputation, are presented in (Dekel et al., 2010;Globerson and Roweis, 2006).Finally, in (Grangier and Melvin, 2010) the embedding mapping of feature-value pairs is constructed together with a classification objective function.
In our contribution, we generalize the imputation-based techniques in such a way to preserve the information of missing features.To select a basepoint we propose to choose the most probable point form a subspace identifying a missing data point, however other imputation methods can be used as well.Constructed representation allows to apply various affine data transformations preserving classical scalar product before applying typical classification methods.

Generalized incomplete data
In this section, we introduce the subspace approach to incomplete data.First, we define a generalized missing data point, which allows to perform affine transformation of incomplete data.Then, we show how to embed generalized missing data into a vector space and select a basepoint.Finally, we define a scalar product on the embedding space.

Incomplete data as pointed affine subspaces
Incomplete data X can be understood as a sequence of pairs (x i , J i ), where x i ∈ R N and J i ⊂ {1, . . ., N } indicates missing coordinates of x i .Therefore, we can associate a missing data point (x, J) with an affine subspace x + span(e j ) j∈J , where (e j ) j is the canonical base of R N .Let us observe that x + span(e j ) j∈J is a set of all N -dimensional vectors which coincide with x on the coordinates different from J.
In this paper, we focus on transforming incomplete data by affine mappings.For this purpose, we generalize the above representation to arbitrary affine subspaces, or more precisely pointed affine subspaces, which do not have to be generated by canonical bases.
Definition 1 A generalized missing data point is defined as a pointed affine subspace A basepoint can be selected by filling missing attributes with a use of any imputation method, which will be discussed in the next subsection.
Remark 2 Observe that the notion of pointed affine subspace differs from classical affine subspace.In particular, pointed subspace depends on the selection of basepoint.In consequence, we can create two different generalized missing data points S y , S z from the same missing data point (x, J) by using different imputation methods.
First, we show that the above definition is useful for defining linear mappings on incomplete data.Let S x = x+V be a generalized missing data point and let f : R N w → Aw +b be an affine map.We can transform a generalized missing data point x + V into another missing data point by the formula: The basepoint x is mapped into Ax + b, while the linear part of f (x + V ) is given by

Consequently, we arrive at the definition:
Definition 3 For a a generalized missing data point S x = x + V and an affine mapping f : w → Aw + b we put: where Ax + b is a basepoint and AV is a linear subspace.
One can easily compute and represent AV , if the orthonormal base v 1 , . . ., v n of V is given, namely we simply orthonormalize the sequence Av 1 , . . ., Av n .

Embedding of generalized missing data
The above representation is useful for understanding and performing affine transformations of incomplete data, such as whitening, dimensionality reduction or incorporating affine constraints to data.Nevertheless, typical machine learning methods require vectors or a kind of kernel (or similarity) matrix as the input.We show how to embed generalized missing data into a vector space.
A generalized missing data point S x = x + V consists of a basepoint x ∈ R N which is an element of vector space and a linear subspace V .To represent a subspace V , we propose to use a matrix of orthogonal projection p V onto V .To get an exact form of p V , let us assume that (v j ) j∈J is an orthonormal base of V .Then, the projection of y ∈ R N can be calculated by The selection of basepoint relies on filling missing attributes with some concrete values, which is commonly known as imputation.In our setting, by the imputation we denote a function Φ : for a generalized missing data S x .
In the case of classical incomplete data, missing attributes are often filled with a mean or a median calculated from existing values for a given attribute.However, these imputations cannot be easily defined in a general case, because the linear part of generalized missing data point might be an arbitrary linear subspace (not necessarily a subspace generated by a subset of canonical base).Let us observe that another popular imputation method, which fills the missing coordinates with zeros can be defined for generalized incomplete data.This is performed by selecting a basepoint of an incomplete data point S x = x + V as the orthogonal projection of missing data x onto the subspace orthogonal to V , i.e.: where (v j ) j∈J is an arbitrary orthonormal base of V .If V is represented by canonical base then this is equivalent to filling missing attributes with zeros.
We propose another technique for setting missing values, which extends zero imputation method.Let us assume that (m, Σ) are the mean and covariance matrix estimated for incomplete dataset X.In this method, a basepoint of x + V is selected as the orthogonal projection of m onto x + V with respect to the Mahalanobis scalar product parametrized by Σ, i.e. x where p Σ V denotes a projection matrix onto V with respect to Mahalanobis scalar product given by Σ.To obtain the values for m and Σ in practice, one can use existing attributes of incomplete data for the calculation of a sample mean and a covariance matrix.Alternatively, if data satisfy missing at random assumption, then the EM algorithm can be applied to estimate the probability model describing data (Schafer, 1997).We call this technique by the most probable point imputation.
Summarizing, our embedding is defined as follows: Definition 4 A generalized missing data point is embedded in a vector space by where S x = x + V and x is a basepoint.
Example 1 To illustrate the effect of missing data imputation and transformation, let us consider the whitening operation: where Σ is the covariance, and m the mean of X.For a generalized missing data the above operation is defined by: In other words, we map a basepoint in a classical way and transform a subspace V into a linear subspace Σ −1/2 V .The illustration is given in Figure 2.
Example 2 In the case of high dimensional data, we sometimes reduce a dimension of input data space by applying the Principle Component Analysis, which is defined by: where m is a mean of a dataset and k columns of W are the leading eigenvectors of covariance matrix Σ.This operation can be extended to the case of generalized missing data by: An example of the above operation is illustrated in the Figure 3.

Scalar product for SVM kernel
To apply most of classification methods it is necessary to define a scalar product (kernel matrix) on a data space.As a natural choice, one could sum the scalar products between basepoints and embedding matrices, i.e.
x + V, y + W = x, y + p V , p W . (2) However, for a data space of dimension N , we have p V 2 = N , which implies that the weight of projection can dominate the first part of (2) concerning basepoints.Consequently, we decided to introduce an additional parameter to allow reducing the importance of projection part:   Definition 5 Let D ∈ [0, 1] be fixed.As a scalar product between two generalized missing data points we put: Let us observe that the above parametric scalar product can be implemented by taking the embedding x + V → (x, √ Dp V ) and then using formula (2) for a scalar product.
Remark 6 Observe that the value of function (3) strictly depends on the selection of basepoints, which makes it not well defined scalar product in the space of classical affine subspaces.Indeed, x + V defines the same affine subspace as x + v + V , where v ∈ V , but such shifts may lead to different values of the right hand side of (3).However, this is well defined scalar product in the case of pointed affine subspaces, because two different selections of basepoints give different pointed affine subspaces (see Remark 2).In consequence, it might be safely used in the case of generalized missing data points considered in this paper.
The following proposition shows how to calculate a scalar product between matrices defining two orthogonal projections onto linear subspaces.
Proposition 7 Let us consider subspaces where v j and w k are orthonormal sequences.If p V , p W denote orthogonal projections onto V, W , respectively, then Proof By the definition of orthogonal projections and the scalar product between matrices, we have Making use of tr(AB) = tr(BA), we get Finally, Concluding, the scalar product between embedding of two generalized missing data points given by Definition 5 can be calculated as: where (v j ) j∈J , (w k ) k∈K are orthonormal bases of V, W , respectively.The last expression can be more numerically efficient if the dimension of the subspaces (the number of missing attributes) is much smaller than the dimension of the whole space.
Remark 8 One of typical representations of missing data (x, J) relies on filling unknown attributes and supplying it with a binary flag vector 1 J ∈ R N , in which bit 1 denotes coordinate belonging to J.This leads to the embedding of the missing data into a vector space given by (x, J) Then, the scalar product of such embedding can be defined by It is worth to noting that the formula (5) coincides with a scalar product defined for generalized missing data (2) (for D = 1).Indeed, if V = span(e j : j ∈ J) and W = span(e k : k ∈ K), for J, K ⊂ {1, . . ., N }, then by Proposition 7 we have, which is exactly the RHS of (5).Therefore, our approach generalizes and theoretically justifies the flag approach to missing data analysis.The importance of our construction lies in its generality, which in particular allows for performing typical affine transformations of data.In the case of flag representation, there is no obvious solution how to perform such mappings on flag vector.
The above scenarios represent classical imputation and our pointed affine subspace approach.We would like to investigate how the information preserved in the subspace influences the classification results.
Finally, we calculated the scalar products (kernel matrices) for such representations of data and trained SVM classifier implemented in libsvm (Chang and Lin, 2011).Missing features of test set instances were filled and transformed based on a training set only.
All experiments assumed double 5-fold cross validation.More precisely, for every division into train and test sets, the required hyperparameters were tuned using inner 5-fold cross validation applied on training set.The combination of parameters maximizing mean accuracy score (on validation set) was used to learn a final classifier on a entire training set, while the performance was evaluated on a testing set that was not used during training.The accuracy was averaged over all 5 trails.We learned a standard margin parameter C as well as a parameter D in the formula of scalar product for subspace embedding.We performed a grid search in the following ranges: C = {10 k : k = −2, −1, 0, 1} and D = { 1 2 k : k = 0, 1, . . ., 10}.

UCI datasets
We used three UCI datasets (for datasets with more than two classes we selected two the most numerous classes): breast cancer (BC), ionosphere (IS) and yeast (Y) (Asuncion and Newman, 2007).In the first case, we randomly removed 90% of features.In the second option, we defined a structural process of attributes removal.More precisely, we drawn N points x 1 , . . ., x N of a dataset X ⊂ R N .Then, for every x ∈ X we removed its i-th attribute with a probability exp(−t x − x i Σ )), where x Σ denotes the Mahalanobis norm of x with respect to Σ and t > 0 was chosen to remove approximately 90% of attributes.
The results presented in Table 1 show that there is no benefit from identifying absent attributes when the features were missing completely at random.One can observe that most probable point imputation usually provided the highest accuracy among the imputation strategies.
In the case of structurally missing features, Table 2, the proposed subspace approach gave better classification results for all datasets and for all cases of imputations.Moreover, the most probable point imputation outperformed other strategies of filling missing coordinates on two out of three datasets.

Medical data
In this application we considered a real angiological dataset acquired from Jagiellonian Center of Experimental Therapeutic containing patients' examinations, http://jcet.eu/new_en/.The goal was to find patients with atherosclerosis.Innovative medical tests are very expensive, time-consuming and in some cases they cannot be successfully completed due to the patient's condition.In consequence, research database contains many empty cells, which is the effect of purely structural process.Since some of parameters are discrete as well as real valued numbers presented in different scales, then a whitening of data is a natural preprocessing step.
The results illustrated in Table 3 partially confirm the hypothesis suggested in previous experiment.Indeed, the use of proposed subspace embedding, gave higher accuracy for all imputation strategies, but the benefit from its application was not significant.It is difficult to decide which imputation strategy was optimal because all of them provided comparable results.

Conclusion
The paper generalized the existing approach of identifying missing attributes with binary flags.To enable appropriate affine transformations of data, we represented incomplete data as pointed affine subspaces and embedded them into a vector space by linking a pointed subspace with a basepoint joined with a corresponding projection matrix.In the same spirit we proposed to select a basepoint as the most probable point from a subspace, which extends the well-known zero imputation strategy.Such a combination provided the best performance in conducted classification experiments in most cases. Florence

Figure 3 :
Figure 3: The image 3(a) with two missing pixels and its projection onto two principal components 3(b).Image was represented by the feature vectors consisting of 8x8 blocks.Missing pixels are identified by the pointed subspaces with basepoints chosen by zero imputation strategy.

Table 2 :
Mean accuracies for a classification of UCI data sets with structural attribute absence.

Table 3 :
Mean accuracies for a classification of medical data.
L Stahura and Jurgen Bajorath.Virtual screening methods that complement HTS.Combinatorial Chemistry & High Throughput Screening, 7(4):259-269, 2004.David Williams and Lawrence Carin.Analytical kernel matrix completion with incomplete multi-view data.In Proceedings of the ICML Workshop on Learning With Multiple Views, 2005.David Williams, Xuejun Liao, Ya Xue, and Lawrence Carin.Incomplete-data classification using logistic regression.In Proceedings of the International Conference on Machine Learning, pages 972-979.ACM, 2005.