Keywords

1 Introduction

Recent years have witnessed significant advances in multi-source learning. Many multi-source techniques have been developed to assist people in extracting useful information from rapidly growing volumes of multi-source data [25, 26]. However, unlike the previous work that focused on the study of mining general patterns from different sources, it is worth to note that no existing efforts have focused on the detection of multi-source heterogeneous outliers. In particular, this detection is generally performed to identify abnormal heterogeneous observation from different sources.

Furthermore, detecting outliers is more interesting and useful than identifying normal instances [4, 11, 17]. For example, to protect the properties of customers, an electronic commerce detection system can monitor the customers’ financial activities in order to identify abnormal consuming behavior of credit card as criminal activities (outlier). Consequently, multi-source outlier detection is an important task in data mining, with many practical applications ranging from fraud detection to public health. Many outlier detection methods have been proposed over the past decades, including mono-source [6, 12, 13, 18] and multi-view [10, 14, 21, 27] outlier detection.

Fig. 1.
figure 1

Multi-source heterogeneous outlier

Mono-source Outlier Detection. Recently, some researchers have investigated many machine learning methods [6, 12, 13, 18] to deal with outlier detection problems in mono-source data. Knorr et al. pointed out in [12] that the identification of outliers can lead to the discovery of truly unexpected knowledge in various actual application fields. Meanwhile, they proposed and analyzed several algorithms for finding DB-outliers. In [13], Li et al. presented a representation-based method to calculate the reverse unreachability of a point to evaluate to what degree this observation is boundary point or outlier. Rahmani and Atia [18] proposed two randomized algorithms for two distinct outlier models namely, sparse and independent outlier models based robust principal component analysis. k-Nearest Neighbors (kNN) [6] defines the distance of a given data point to its kth nearest neighbors as the outlier. The greater the value of the score, the most likely the outlier.

However, with the emergence of more and more multi-source data in many real-world scenarios, the task of outlier detection becomes even more challenging as traditional mono-source outlier detection techniques can no longer be suitable for multi-source heterogeneous data. For example, as shown in Fig. 1, it is greatly difficult in the single voice source to discovery the outlier \(x_4\), i.e., White tiger, hidden in the cluster of Siberian tiger. However, according to the distribution of the corresponding Image source, the heterogeneous outlier \(y_4\) can be easily identified. Thus, it is necessary to develop an effective outlier detection method for multi-source heterogeneous data.

Multi-view Outlier Detection. To overcome the drawback of traditional mono-source methods, several efforts [10, 14, 21, 27] have been devoted to identifying multi-source outliers. In [21], Li et al. proposed a multi-view outlier detection framework to detect two different types of outliers from multiple views simultaneously. The characteristic of this method is that it only can detect the outliers from different views in their own original low-level feature spaces. Similarly, Zhao and Fu [27] also investigated multi-view outlier detection problem to detect two different kinds of anomalies simultaneously. Different from Li’s approach, they represented both kinds of outliers in two different spaces, i.e. latent space and original feature space. Janeja and Palanisamy presented in [10] a two-step method to find anomalous points across multiple domains. This technique first conducted single-domain anomaly detection to discover outliers in each domain, then mined association rule across domains to discover relationship between anomalous points. A multi-view anomaly detection algorithm was developed by Liu and Lam in [14] to find potentially malicious insider activity across multiple data sources.

Difficulties and Challenges. Generally, these existing methods tend to identify multi-source outliers from different feature spaces by using association analysis across sources. Most of these methods are, however, designed for detecting the Class-outliers [21, 27] in spaces of the original attributes, and discovering Attribute-outliers [21, 27] in combinations of underlying spaces. Consequently, these methods will face an enormous challenge in the real-world applications for the following reason. Due to the attempt of identifying multi-source outliers in different original feature spaces, it is extremely difficulty for the above-mentioned approaches to capture much more complementary information from different sources. It will lead to a low recognition rate for multi-source outliers. Furthermore, it has been proved in [26] that the consistent representations for multi-source heterogeneous data will be more favorable for fully exploiting the complementarity among different sources. Thus, it is inevitably an urgent problem to detect all kinds of multi-source outliers from different sources in a consistent feature-homogeneous space.

1.1 Main Contributions

The key contributions of this paper are highlighted as follows:

  • To detect multi-source heterogeneous outliers, a general Multi-Source Manifold Outlier Detection (MMOD) framework based the consistent representations for multi-source heterogeneous data is proposed.

  • Manifold learning is integrated in the framework to obtain a shared-representation space, in which the information-correlated representations are close along manifold while the semantic-complementary instances are close in Euclidean distance.

  • According to the information compatibility among different sources, an affine subspace is learned through affine combination of shared representations from different sources in the feature-homogeneous space.

  • Multi-source heterogeneous outliers can be effectively identified in the affine subspace under the constraints of information compatibility among different sources.

1.2 Organization

The remainder of this paper is organized as follows: In Sect. 1.3, the notations are formally defined. We present a general framework for detecting multi-source heterogeneous outliers in Sect. 2.1. Furthermore, Sect. 2.2 provides an efficient algorithm to solve the proposed framework. Experimental results and analyses are reported in Sect. 3. Section 4 concludes this paper.

1.3 Notations

In this section, some important notations are summarized into Table 1 for convenience.

Table 1. Notations

2 Detecting Multi-source Heterogeneous Outliers

Here we propose a general framework to detect heterogeneous outliers in multi-source datasets.

2.1 The Proposed MMOD Model

In the light of the existing multi-source methods’ shortcomings, we focus on a particularly important problem of identifying all kinds of multi-source heterogeneous outliers in a consistent feature-homogeneous space.

In general, outliers are located around the margin of the data set with high density, such as a cluster. Furthermore, Elhamifar [7] has pointed out that each data point in a union of subspaces can be efficiently represented as a linear or affine combination of other points in the dataset. Meanwhile, Bertsekas has also proved that all convex combinations are geometrically within the convex hull of the given points [2]. Consequently, the negative components in the representation correspond to the outliers outside the convex combination of its neighbors [5, 20].

Fig. 2.
figure 2

4 types of heterogeneous outliers.

Following the above-mentioned theoretical results, we propose a novel Multi-Source Manifold Outlier Detection (MMOD) framework based the consistent representations for multi-source heterogeneous data. The main goal of the proposed framework is unified detection of outliers from heterogeneous datasets in a feature-homogeneous space, in order to avoid wasting the complementary information among different sources and improve the recognition rate for multi-source outliers.

In this paper, we focus on detecting all kinds of heterogeneous outliers (See Fig. 2) from multi-source heterogeneous data. In particular, we aim to identify four types of heterogeneous outliers that are defined below.

Definition 1

Type-A outliers have consistent abnormal behaviors in each source.

Definition 2

Type-B outliers are deviant instances that show normal clustering results in one source but abnormal cluster memberships in another source.

Definition 3

Type-C outliers own abnormal clustering results in each source.

Definition 4

Type-D outliers refer to exceptional samples that exhibit normal clustering results in one source but abnormal behavior in another source.

Given a normal dataset \(X_N=\{x_1,x_2,\cdots ,x_{n_1}\}\in \mathbb {R}^{d_x \times n_1}\) and a sample \(c\in \mathbb {R}^{d_x}\), an affine space \(H=\{w\in \mathbb {R}^{n_1}|X_Nw=c\}\) can be spanned by its neighbors from Source X. Note that w is the representation of c in the affine space, and \(w_i\) is the component of the representation w of c. Generally, outliers are located around the margin of the dataset with high density, such as a cluster.

It is known that if \(0 \le w_i \le 1\), then the point c will be within (or on the boundary) of the convex hull. If any \(w_i\) is less than zero or greater than 1, then the point will lie outside the convex hull. Thus, data representation can uncover the intrinsic data structure. Obviously, heterogeneous outliers can be identified according to the following principle.

  1. 1.

    c is normal point if \(0 \le w_i \le 1\).

  2. 2.

    c is abnormal point (outlier) if any \(w_i<0\) or \(w_i>1\).

Specifically, the new distance metrics are defined as follows to learn a Mahalanobis distance [23]:

$$\begin{aligned} \begin{array}{l} \mathcal {D}_{M_X}(x_i,x_j)=(x_i-x_j)^{T}M_X(x_i-x_j),\\ \end{array} \end{aligned}$$
(1)
$$\begin{aligned} \begin{array}{l} \mathcal {D}_{M_Y}(y_i,y_j)=(y_i-y_j)^{T}M_Y(y_i-y_j),\\ \end{array} \end{aligned}$$
(2)

where \(M_X=A^TA\) and \(M_Y=B^TB\) are two positive semi-definite matrices. Thus, the linear transformations A and B can be applied to each pair of co-occurring heterogeneous representations \((x_i,y_i)\).

Then the proposed approach can be formulated as follows:

$$\begin{aligned} \varvec{\varPsi _1}: \begin{array}{l} \displaystyle \underset{A,B,M,W}{min} \parallel {X_NAMB^TY_N^T}\parallel _F^2 + \alpha \parallel {X_NA - Y_NB}\parallel _F^2 \\ \displaystyle \quad s{.}t{.} \quad \quad \!\!\! A^TA=I \quad and \quad B^TB=I \quad and \quad M \succeq 0 \\ \displaystyle \quad \quad \quad \quad \! X_SA=W^TY_NB \end{array} \end{aligned}$$
(3)

where \(A\in \mathbb {R}^{d_x \times k}\), \(B\in \mathbb {R}^{d_y \times k}\), k is the dimensionality of the feature-homogeneous subspace, and \(\alpha \) is a trade-off parameter. The first item of the objective function in the model \(\varPsi _1\) is to measure the smoothness between different linear transformations A and B to extract the information correlation among heterogeneous representations. Moreover, the motivation of introducing the second item in the objective function is to capture the semantic complementarity among different sources. The orthogonal constraints \(A^TA = I\) and \(B^TB = I\) are added into the optimization to effectively remove the correlations among different features in the same source, the positive semidefinite restraint \(M\in \mathbb {S}^{k \times k}_{+}\succeq 0\) can ensure a well-defined pseudo-metric. To identify multi-source heterogeneous outliers, the affine hull constraint \(X_SA=W^TY_NB\) based on data representation is added into the model \(\varPsi _1\) to learn an affine subspace. The matrices \(W\in \mathbb {R}^{n_1 \times n_2}\) encodes the neighborhood relationships between points in the affine subspace, \(w_i\in \mathbb {R}^{n_1}\) is the representation of \(x_S^i\in \mathbb {R}^{d_x}\) in the affine subspace, respectively.

Note that solving the problem \(\varPsi _1\) in Eq. (3) directly is a challenging task for two main reasons. First, it is difficult to seek the solution that satisfies the convex hull constraint. Second, the orthogonal constraints are not smooth, which makes it even more difficult to compute the optimum. Thus, we propose to use Lagrangian duality to augment the objective function with a weighted sum of the convex hull constraint to obtain a solvable problem \(\varPsi _2\) as follows:

$$\begin{aligned} \varvec{\varPsi }_{\varvec{2}}: \begin{array}{l} \displaystyle \underset{A,B,M,W}{min} \parallel {X_NAMB^TY_N^T}\parallel _F^2 + \alpha \parallel {X_NA - Y_NB}\parallel _F^2 + \beta \parallel {\!X_SA \!-\! W^TY_NB\!}\parallel _F^2 \\ \displaystyle \quad s{.}t{.} \quad \quad \!\!\! A^TA=I \quad and \quad B^TB=I \quad and \quad M \succeq 0 \end{array} \end{aligned}$$
(4)

In Sect. 2.2, an efficient algorithm is proposed to solve the problem \(\varPsi _2\).

2.2 An Efficient Solver for \({\varvec{\varPsi }}_{\varvec{2}}\)

Here we provide an efficient algorithm to solve \(\varPsi _2\).

The optimization problem \(\varPsi _2\) in Eq. (4) can be simplified as follows:

$$\begin{aligned} \begin{array}{l} \displaystyle \underset{Z\in \mathcal {C}}{min} \quad F(Z) = \Vert \cdot \Vert _F + \alpha \Vert \cdot \Vert _F + \beta \Vert \cdot \Vert _F,\\ \end{array} \end{aligned}$$
(5)

where \(F(\cdot )\) is a smooth objective function, \(Z\!=\![A_Z \quad B_Z \quad M_Z \quad W_Z]\) symbolically represents the optimization variables, and \(\mathcal {C}\) is the closed domain with respect to each variable:

$$\begin{aligned} \begin{array}{l} \mathcal {C}=\{Z|A^T_ZA_Z=I, B^T_ZB_Z=I, M_Z\succeq 0\}.\\ \end{array} \end{aligned}$$
(6)

Obviously, the optimization problem in Eq. (5) is non-convex. However, Ando and Zhang have testified in [1] that the alternating optimization method can effectively solve non-convex problem. They have also pointed out that this method usually did not lead to serious problems since given the local optimal solution of one variable, the solution of other variables would still be globally optimal.

Additionally, the problem in Eq. (5) is separately convex with respect to each optimization variable. Furthermore, as \(F(\cdot )\) is continuously differentiable with Lipschitz continuous gradient [15] with respect to each variable, respectively. Thus, through combining Accelerated Projected Gradient (APG) [15] method and alternating optimization approach [1], the problem in Eq. (5) can be effectively solved.

However, the non-convex optimization problem in Eq. (5) is generally difficult to optimize due to the orthogonal constraints. Guo and Xiao have pointed out in [8] that Gradient Descent Method with Curvilinear Search (GDMCS) in [24] can effectively solve non-convex optimization problem for a local optimal solution as long as the Armijo-Wolfe conditions are satisfied.

Furthermore, since the objective function in Eq. (5) is smooth, the gradient of the objective function with respect to AB can be easily computed, respectively. In each iteration of the gradient descent procedure, given the current feasible point (AB), the gradients can be computed as follows:

$$\begin{aligned} \begin{array}{l} G_1 = \triangledown _AF(A,B), \end{array} \end{aligned}$$
(7)
$$\begin{aligned} \begin{array}{l} G_2 = \triangledown _BF(A,B). \end{array} \end{aligned}$$
(8)

We then compute two skew-symmetric matrices:

$$\begin{aligned} \begin{array}{l} F_1 = G_1A^T - AG^T_1,\\ \end{array} \end{aligned}$$
(9)
$$\begin{aligned} \begin{array}{l} F_2 = G_2B^T - BG^T_2.\\ \end{array} \end{aligned}$$
(10)

It is easy to see \(F_1^T=-F_1\) and \(F_2^T=-F_2\). The next new point can be searched as a curvilinear function of a step size variable \(\tau \), such that

$$\begin{aligned} \begin{array}{l} Q_1(\tau ) = (I+\tau F_1/2)^{-1}(1-\tau F_1/2)A,\\ \end{array} \end{aligned}$$
(11)
$$\begin{aligned} \begin{array}{l} Q_2(\tau ) = (I+\tau F_2/2)^{-1}(1-\tau F_2/2)B.\\ \end{array} \end{aligned}$$
(12)

It is easy to verify that \(Q_1(\tau )^TQ_1(\tau )=I\) and \(Q_2(\tau )^TQ_2(\tau )=I\) for all \(\tau \in \mathbb {R}\). Thus we can stay in the feasible region along the curve defined by \(\tau \). Moreover, \(dQ_1(0)/d\tau \) and \(dQ_2(0)/d\tau \) are equal to the projections of \((-G_1)\) and \((-G_2)\) onto the tangent space \(\mathcal {C}\) at the current point (AB). Hence \(\{Q_1(\tau ),Q_2(\tau )\}_{(\tau \ge 0)}\) is a descent path in the close neighborhood of the current point. We thus apply a similar strategy as the standard backtracking line search to find a proper step size \(\tau \) using curvilinear search, while guaranteeing the iterations to converge to a stationary point. We determine a proper step size \(\tau \) as one satisfying the following Armijo-Wolfe conditions [24]:

$$\begin{aligned} \begin{array}{l} F(Q_1(\tau ),Q_2(\tau )) \le F(Q_1(0),Q_2(0)) + \rho _1\tau F_\tau ^{'}(Q_1(0),Q_2(0)),\\ \end{array} \end{aligned}$$
(13)
$$\begin{aligned} \begin{array}{l} F_\tau ^{'}(Q_1(\tau ),Q_2(\tau )) \ge \rho _2F_\tau ^{'}(Q_1(0),Q_2(0)).\\ \end{array} \end{aligned}$$
(14)

Here \(F_\tau ^{'}(Q_1(\tau ),Q_2(\tau ))\) is the derivative of F with respect to \(\tau \),

(15)

Therefore,

$$\begin{aligned} F_\tau ^{'}(Q_1(0),Q_2(0))&= -tr(G_1^T(G_1A^T-AG_1^T)A) \nonumber \\&~~ -tr(G_2^T(G_2B^T-BG_2^T)B) \nonumber \\&= -\displaystyle \frac{\Vert F_1\Vert _F^2}{2}-\frac{\Vert F_2\Vert _F^2}{2}. \end{aligned}$$
(16)

Accordingly, it is appropriate to use the gradient descent method to solve the problem \(\varPsi _2\) in Eq. (5).

The APG algorithm is a first-order gradient method, which can accelerate each gradient step on the feasible solution to obtain an optimal solution when minimizing a smooth function [16]. This method will construct a solution point sequence \(\{Z_i\}\) and a searching point sequence \(\{S_i\}\), where each \(Z_i\) is updated from \(S_i\).

Furthermore, a given point s in the APG algorithm needs to be projected into the set \(\mathcal {C}\):

$$\begin{aligned} \begin{array}{l} proj_{\mathcal {C}}(s)=arg \quad \!\!\!\! \underset{z\in \mathcal {C}}{min}\Vert z-s\Vert _F^{2}/2.\\ \end{array} \end{aligned}$$
(17)

Weinberger et al. proposed a Positive Semi-definite Projection (PSP) [23] to minimize a smooth function while remaining positive semi-definite constraints. It will project optimal variables into a cone of all positive semi-definite matrices after each gradient step. The projection is computed from the diagonalization of optimal variables, which effectively truncates any negative eigenvalues from the gradient step, setting them to zero. Then we can use the PSP to solve the problem in Eq. (17).

Finally, to solve the problem in Eq. (5), the projection \(Z=[A_Z \!\quad \! B_Z \!\quad \! M_Z \!\quad \! W_Z]\) of a given point \(S=[A_S \!\quad \! B_S \!\quad \! M_S \!\quad \! W_S]\) onto the set \(\mathcal {C}\) is defined by:

$$\begin{aligned} \begin{array}{l} proj_{\mathcal {C}}(S)=arg \quad \!\!\!\! \underset{Z\in \mathcal {C}}{min}\Vert Z-S\Vert _F^{2}/2.\\ \end{array} \end{aligned}$$
(18)

By combining APG, GDMCS, and PSP, we can solve the problem in Eq. (18). The overall algorithm is given in Algorithm 1, where the function \({{\varvec{Schmidt}}}(\cdot )\) denotes the GramSchmidt process.

figure a

3 Experimental Evaluation

Our experiments are conducted on three publicly available multi-source datasets, namely, UCI Multiple Features (UCI MFeat) [3], Wikipedia [19], and MIR Flickr [9]. The statistics of the datasets are given in Table 2, and brief descriptions of the chosen feature sets in the above-mentioned datasets are listed in Table 3.

Table 2. Statistics of the multi-source datasets
Table 3. Brief descriptions of the feature sets

Note that all the data are normalized to unit length. Each dataset is randomly separated into a training set and a test set. The training samples account for 80% of each original dataset, and the remaining ones act as the test data. Such a partition of each dataset is repeated five times and the average performance is reported. The 100 outliers are generated from Gaussian noise with the same dimension as normal samples in each source. We mix these outlier data into each dataset. Some key parameters of all the methods in our experiments are tuned using the 5-fold cross-validation based on the AUC (area under the receiver operating characteristic curve) on the training set. Particularly, the LIBSVM classifier serves as the benchmark for the tasks of classification in the experiments.

3.1 Comparison of Multi-view Outlier Detection Methods

The purpose of comparing the proposed MMOD model and multi-view outlier detection methods, such as Li’s method [21], Zhao’s method [27], Janeja’s method [10], and Liu’s method [14] is to show the importance of identifying multi-source outliers in a consistent feature-homogeneous space. Due to the attempt of identifying multi-source outliers in different original feature spaces, it is extremely difficulty for the other compared approaches to capture much more complementary information from different sources. It will lead to a low recognition rate for multi-source outliers. To validate this point, we further compare the recognition rate of MMOD with the above-mentioned multi-view outlier detection methods. The parameter settings in the compared methods are the same as in their original literatures.

Table 4. Comparison of multi-view outlier detection methods

The proposed MMOD model identifies multi-source outliers in a consistent feature-homogeneous space. As shown in Table 4, MMOD can improve more effectively the recognition rate for multi-source outliers than Li’s method, Zhao’s method, Janeja’s method, and Liu’s method. It means that MMOD can capture the information compatibility among different sources more effectively.

3.2 Comparison of Mono-source Outlier Detection Approaches

To evaluate the performance of outlier detection, we compare our method with some representative state-of-the-art mono-source methods such as Li’s method [13], Rahmani’s method [18], and kNN [6] in three multi-source datasets. Basic metric, Precision (P), is used to evaluate the ability of each algorithm. For Li’s method, Rahmani’s method, and kNN, we first use CCA [22] to project the multi-source data into a feature-homogeneous space and then apply these methods to retrieve the most likely outlier. For MMOD, we tune the regularization parameters on the set \(\{10^i|i=-2,-1,0,1,2\}\). For Li’s method and Rahmani’s method, the experiment settings follow the original works [13, 18], respectively. The parameter k in kNN is selected from the set \(\{2*i+1|i=5,10,15,20,25\}\).

Fig. 3.
figure 3

Comparisons of outlier detection approaches

From Fig. 3, we can see that MMOD achieves significant gains, and can almost detect all the outliers. This observation indicates that MMOD will be more favorable to detect multi-source heterogeneous outliers because of fully taking into account the information compatibility and semantic complementarity among different sources.

3.3 Comparison in Different Outlier Rates

To test the performance of the proposed MMOD in different outlier rates, we further compare the recognition rate of MMOD with other multi-view outlier detection methods such as Li’s method [21], Zhao’s method [27], Janeja’s method [10], and Liu’s method [14] in the larger MIR Flickr dataset. We tune the outlier rates on the set \(\{10\%, 15\%, 20\%, 25\%\}\).

Fig. 4.
figure 4

Comparison in different outlier rates

We can see from Fig. 4 that MMOD is superior to other multi-view outlier detection methods in recognition rate. This observation further confirms that MMOD can effectively identify multi-source outliers. Nevertheless, with the increasing of outlier rate, the performance of MMOD will degrade. Thus, MMOD also has some limitations that it need a certain number of existing samples to identify multi-source outliers.

4 Conclusion

In this paper, we have investigated the heterogeneous outlier detection problem in multi-source learning. We developed a MMOD framework based the consistent representations for multi-source heterogeneous data. Within this framework, Manifold learning is integrated to obtain a shared-representation space, in which the information-correlated representations are close along manifold while the semantic-complementary instances are close in Euclidean distance. Meanwhile, an affine subspace is learned through affine combination of shared representations from different sources in the feature-homogeneous space according to the information compatibility among different sources. Finally, multi-source heterogeneous outliers can be effectively identified in the affine subspace.