Abstract
Most of existing outlier detection methods assume that the outlier factors (i.e., outlierness scoring measures) of data entities (e.g., feature values and data objects) are Independent and Identically Distributed (IID). This assumption does not hold in real-world applications where the outlierness of different entities is dependent on each other and/or taken from different probability distributions (non-IID). This may lead to the failure of detecting important outliers that are too subtle to be identified without considering the non-IID nature. The issue is even intensified in more challenging contexts, e.g., high-dimensional data with many noisy features. This work introduces a novel outlier detection framework and its two instances to identify outliers in categorical data by capturing non-IID outlier factors. Our approach first defines and incorporates distribution-sensitive outlier factors and their interdependence into a value-value graph-based representation. It then models an outlierness propagation process in the value graph to learn the outlierness of feature values. The learned value outlierness allows for either direct outlier detection or outlying feature selection. The graph representation and mining approach is employed here to well capture the rich non-IID characteristics. Our empirical results on 15 real-world data sets with different levels of data complexities show that (i) the proposed outlier detection methods significantly outperform five state-of-the-art methods at the 95%/99% confidence level, achieving 10–28% AUC improvement on the 10 most complex data sets; and (ii) the proposed feature selection methods significantly outperform three competing methods in enabling subsequent outlier detection of two different existing detectors.
Similar content being viewed by others
Notes
Having excessive thirsty, weight loss and frequent urination are the abnormal concurrent symptoms in diagnosing type-2 diabetes according to https://www.diabetesaustralia.com.au/.
We have ignored features with \( freq (m)=1\), as those features contain no useful information relevant to outlier detection.
\(\delta \) is normalized into the range in (0,1) to work well with \(\eta \).
The source codes of CBRW/SDRW-based outlier detection (or feature selection) algorithms are made publicly available at https://sites.google.com/site/gspangsite/sourcecode.
Since CompreX was implemented in a different programming language to the other methods, the runtime between CompreX and other methods is incomparable. Instead, we compare them in terms of runtime ratio, i.e., the runtime on a larger/higher-dimensional data set divided by that on a smaller/lower-dimensional data set, for a fairer comparison. Since the data size and the increasing factor of dimensionality are fixed, the runtime ratio is comparable across the methods in different programming languages.
References
Aggarwal CC (2017a) Outlier detection in categorical, text, and mixed attribute data. In: Outlier analysis, pp 249–272. Springer, Berlin
Aggarwal CC (2017b) Outlier analysis, second edn. Springer, Berlin
Akoglu L, Tong H, Vreeken J, Faloutsos C (2012) Fast and reliable anomaly detection in categorical data. In: CIKM, pp 415–424. ACM
Akoglu L, Tong H, Koutra D (2015) Graph based anomaly detection and description: a survey. Data Min Knowl Disc 29(3):626–688
Andersen R, Chellapilla K (2009) Finding dense subgraphs with size bounds. In: Algorithms and models for the web-graph, pp 25–37
Angiulli F, Palopoli L et al (2008) Outlier detection using default reasoning. Artif Intell 172(16–17):1837–1872
Angiulli F, Fassetti F, Palopoli L (2009) Detecting outlying properties of exceptional objects. ACM Trans Datab Syst 34(1):7
Angiulli F, Ben-Eliyahu-Zohary R, Palopoli L (2010) Outlier detection for simple default theories. Artif Intell 174(15):1247–1253
Azmandian F, Yilmazer A, Dy JG, Aslam J, Kaeli DR, et al (2012) GPU-accelerated feature selection for outlier detection using the local kernel density ratio. In ICDM, pp 51–60. IEEE
Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: SDM, pp 243–254. SIAM
Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. ACM SIGMOD Record 29(2):93–104
Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. ACM SIGMOD Record 26(2):265–276
Campos GO, Zimek A, Sander J, Campello RJGB, Micenková B, Schubert E, Assent I, Houle ME (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Disc 30(4):891–927
Cao L (2014) Non-iidness learning in behavioral and social data. Comput J 57(9):1358–1370
Cao L (2015) Coupling learning of complex interactions. Inf Process Manag 51(2):167–186
Cao L (2018) Data science thinking: the next scientific. Technological and Economic Revolution, Springer, Berlin
Cao L, Yuming O, Philip SY (2012) Coupled behavior analysis with applications. IEEE Trans Knowl Data Eng 24(8):1378–1392
Cao L, Dong X, Zheng Z (2016) e-nsp: Efficient negative sequential pattern mining. Artif Intell 235:156–182
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15
Chau DH, Nachenberg C, Wilhelm J, Wright A, Faloutsos C (2011) Polonium: Tera-scale graph mining and inference for malware detection. In: SDM, pp 131–142. SIAM
Das K, Schneider J (2007) Detecting anomalous records in categorical datasets. In: KDD, pp 220–229. ACM
Diaconis P, Stroock D (1991) Geometric bounds for eigenvalues of markov chains. Ann Appl Probab 1(1):36–61
Emmott AF, Das S, Dietterich T, Fern A, Wong W-K (2013) Systematic construction of anomaly detection benchmarks from real data. In: KDD workshop, pp 16–21. ACM
Fan X, Xu RYD, Cao L (2016) Copula mixed-membership stochastic blockmodel. In: IJCAI, pp 1462–1468
Fill JA (1991) Eigenvalue bounds on convergence to stationarity for nonreversible markov chains, with an application to the exclusion process. Ann Appl Probab 1(1):62–87
Fowler JH, Christakis NA (2008) Dynamic spread of happiness in a large social network: longitudinal analysis over 20 years in the framingham heart study. BMJ 337:a2338
Ganiz MC, George C, Pottenger WM (2011) Higher order naive bayes: a novel non-iid approach to text classification. IEEE Trans Knowl Data Eng 23(7):1022–1034
Giacometti A, Soulet A (2016) Anytime algorithm for frequent pattern outlier detection. Int J Data Sci Anal, pp 1–12
Gómez-Gardeñes J, Latora V (2008) Entropy rate of diffusion processes on complex networks. Phys Rev E 78(6):065102
Guha S, Mishra N, Roy G, Schrijvers O (2016) Robust random cut forest based anomaly detection on streams. In: ICML, pp 2712–2721
Gupta M, Gao J, Aggarwal C, Han J (2014) Outlier detection for temporal data. Synth Lect Data Min Knowl Discov 5(1):1–129
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18
Hand DJ, Till RJ (2001) A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn 45(2):171–186
He J (2017) Learning from data heterogeneity: algorithms and applications. In: IJCAI, pp 5126–5130
He J, Carbonell J (2010) Coselection of features and instances for unsupervised rare category analysis. Stat Anal Data Min 3(6):417–430
He Z, Xu X, Huang ZJ, Deng S (2005) FP-outlier: frequent pattern based outlier detection. Comput Sci Inf Syst 2(1):103–118
Ho TK, Basu M (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Intell 24(3):289–300
Ienco D, Pensa RG, Meo R (2017) A semisupervised approach to the detection and characterization of outliers in categorical data. IEEE Trans Neural Netw Learn Syst 28(5):1017–1029
Jian S, Cao L, Pang G, Lu K, Gao H (2017) Embedding-based representation of categorical data by hierarchical value coupling learning. In: IJCAI, pp 1937–1943
Khuller S, Barna S (2009) On finding dense subgraphs. Automata, Languages and Programming, pp 597–608
Koufakou A, Georgiopoulos M (2010) A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes. Data Min Knowl Disc 20(2):259–289
Koufakou A, Secretan J, Georgiopoulos M (2011) Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data. Knowl Inf Syst 29(3):697–725
Koutra D, Ke T-Y, Kang U, Chau D, Pao H-K, Faloutsos C (2011) Unifying guilt-by-association approaches: theorems and fast algorithms. In: Machine learning and knowledge discovery in databases, pp 245–260
Leyva E, González A, Perez R (2015) A set of complexity measures designed for applying meta-learning to instance selection. IEEE Trans Knowl Data Eng 27(2):354–367
Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, Liu H (2016) Feature selection: a data perspective. CoRR, arXiv:abs/1601.07996
Liang J, Parthasarathy S (2016) Robust contextual outlier detection: Where context meets sparsity. In: Proceedings of the 25th ACM international on conference on information and knowledge management, pp 2167–2172. ACM
Liu FT, Ting KM, Zhou Z-H (2012) Isolation-based anomaly detection. ACM Trans Knowl Discov Data 6(1):3:1–3:39
Maldonado S, Weber R, Famili F (2014) Feature selection for high-dimensional class-imbalanced data sets using support vector machines. Inf Sci 286:228–246
McGlohon M, Bay S, Anderle MG, Steier DM, Faloutsos C (2009) SNARE: a link analytic system for graph labeling and risk detection. In: KDD, pp 1265–1274. ACM
McPherson M, Smith-Lovin L, Cook JM (2001) Birds of a feather: homophily in social networks. Ann Rev Sociol 27(1):415–444
Meyer CD (2000) Matrix analysis and applied linear algebra. SIAM, Philadelphia
Otey ME, Ghoting A, Parthasarathy S (2006) Fast distributed outlier detection in mixed-attribute data sets. Data Min Knowl Disc 12(2–3):203–228
Page L, Brin S, Motwani R, Winograd T (1998) The PageRank citation ranking: bringing order to the web. In: WWW conference, pp 161–172
Pang G, Ting KM, Albrecht D (2015) LeSiNN: detecting anomalies by identifying least similar nearest neighbours. In: ICDM workshop, pp 623–630. IEEE
Pang G, Cao L, Chen L (2016) Outlier detection in complex categorical data by modelling the feature value couplings. In IJCAI, pp 1902–1908
Pang G, Cao L, Chen L, Lian D, Liu H (2018) Sparse modeling-based sequential ensemble learning for effective outlier detection in high-dimensional numeric data. In: Thirty-second AAAI conference on artificial intelligence
Pang G, Shen C, Cao L, van den Hengel A (2020) Deep learning for anomaly detection: a review. arXiv preprint arXiv:2007.02500
Rayana S, Akoglu L (2016) Less is more: building selective anomaly ensembles. ACM Trans Knowl Discov Data 10(4):42
Rayana S, Zhong W, Akoglu L (2016) Sequential ensemble learning for outlier detection: a bias-variance perspective. In: 2016 IEEE 16th international conference on data mining (ICDM), pp 1167–1172. IEEE
Schubert E, Wojdanowski R, Zimek A, Kriegel H-P (2012) On evaluation of outlier rankings and outlier scores. In: Proceedings of the 2012 SIAM international conference on data mining, pp 1047–1058. SIAM
Smets K, Vreeken J (2011) The odd one out: identifying and characterising anomalies. In: SDM, pp 109–148. SIAM
Smith MR, Martinez T, Giraud-Carrier C (2014) An instance level analysis of data complexity. Mach Learn 95(2):225–256
Sugiyama M, Borgwardt K (2013) Rapid distance-based outlier detection via sampling. In: NIPS, pp 467–475
Sun Y, Han J (2012) Mining heterogeneous information networks: principles and methodologies. Synth Lect Data Min Knowl Discov 3(2):1–159
Tang G, Pei J, Bailey J, Dong G (2015) Mining multidimensional contextual outliers from categorical relational data. Intell Data Anal 19(5):1171–1192
Tang J, Gao H, Hu X, Liu H (2013) Exploiting homophily effect for trust prediction. In: WSDM, pp 53–62. ACM
Ting KM, Zhou GT, Liu FT, Tan SC (2013) Mass estimation. Mach Learn 90(1):127–160
Ting KM, Washio T, Wells JR, Aryal S (2017) Defying the gravity of learning curve: a characteristic of nearest neighbour anomaly detectors. Mach Learn 106(1):55–91
Wong W-K, Moore A, Cooper G, Wagner M (2003) Bayesian network anomaly pattern detection for disease outbreaks. In: ICML, pp 808–815
Shu W, Wang S (2013) Information-theoretic outlier detection for large-scale categorical data. IEEE Trans Knowl Data Eng 25(3):589–602
Zhang Q, Cao L, Zhu C, Li Z, Sun J (2018) Coupledcf: learning explicit and implicit user-item couplings in recommendation for deep collaborative filtering. In: IJCAI’2018, pp 3662–3668
Zheng G, Brantley SL, Lauvaux T, Li Z (2017) Contextual spatial outlier detection with metric learning. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 2161–2170. ACM
Zhou Z-H, Sun Y-Y, Li Y-F (2009) Multi-instance learning by treating instances as non-iid samples. In: ICML, pp 1249–1256. ACM
Zimek A, Campello RJGB, Sander J (2013) Ensembles for unsupervised outlier detection: challenges and research questions. ACM SIGKDD Explor Newsl 15(1):11–22
Acknowledgements
This work was partially supported by the Australian Research Council discovery grant (DP190101079) and ARC Future Fellowship grant (FT190100734).
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Srinivasan Parthasarathy.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Proofs of the theorems
1.1 A.1 Proof of Theorem 1
Proof
If the graph \(\mathsf {G}\) is irreducible and aperiodic, then based on the Perron–Frobenius Theorem (Meyer 2000), the URWs on \(\mathsf {G}\) based on the adjacency matrix \(\mathbf {A}\) will converge to a unique probability vector. \(\square \)
Lemma 1
(Equivalence between BRWs and URWs) BRWs based on the adjacency matrix \(\mathbf {A}\) and the bias \(\delta \) is equivalent to URWs on a graph \(\mathsf {G}^{b}\) with an adjacency matrix \(\mathbf {B}\), in which
This lemma holds iff the transition matrix \(\mathbf {T}\) of \(\mathsf {G}^{b}\) satisfies: \(\mathbf {T} \equiv \mathbf {W}^{b}\). Since \(\mathbf {B}(u,v)=\delta (u)\mathbf {A}(u,v)\delta (v)\), we have
which completes the proof of the lemma.
Since \(\delta \) is always positive, the inclusion of \(\delta \) into \(\mathbf {A}\) does not change the graph’s irreducibility and aperiodicity. Based on Lemma 1, \(\mathbf {B}\) and \(\mathbf {A}\) have the same irreducibility and aperiodicity. Therefore, if \(\mathsf {G}\) is irreducible and aperiodic, so is \(\mathsf {G}^{b}\). We therefore have \(\varvec{\pi }^{*}=\mathbf {W}^{b}\varvec{\pi }^{*}\). \(\square \)
1.2 A.2 Proof of Theorem 2
Proof
To prove Eq. (20), we need to show that when \(\varvec{\pi }^{\prime }(u) =\frac{d^{\prime }(u)}{ vol (\mathsf {G}^{\prime })}\), \(\forall u \in \mathcal {V}\), we have \(\varvec{\pi }^{\prime }=\mathbf {W}^{b\prime }\varvec{\pi }^{\prime }\), i.e., \(\varvec{\pi }^{\prime }\) becomes steady w.r.t. the time step.
First, the probability of visiting v is \(\varvec{\pi }^{\prime ,t+1}(v) = \sum _{u\in \mathcal {V}}\varvec{\pi }^{\prime ,t}(u)\mathbf {W}^{b\prime }(u,v)\). We then have
When \(\varvec{\pi }^{\prime ,t}(u) =\frac{d^{\prime }(u)}{ vol (\mathsf {G}^{\prime })}\), we have
Since \(\mathbf {B}^{\prime }(u,v)= \mathbf {B}^{\prime }(v,u)\), we further have
Therefore, we also have \(\varvec{\pi }^{\prime ,t+1}(u)=\frac{d^{\prime }(u)}{ vol (\mathsf {G}^{\prime })}=\varvec{\pi }^{\prime ,t}(u)\), i.e., \(\varvec{\pi }^{\prime }\) becomes steady. \(\square \)
1.3 A.3 Proof of Theorem 3
Proof
We show \( diff =\left( \delta (u^{\prime }) - \phi _{ freq }(u^{\prime })\right) - \left( \delta (v^{\prime }) - \phi _{ freq }(v^{\prime })\right) > 0\) to complete the proof.
First, we have
Let \(C = N - supp (m)\) and \(H = supp (m) + supp (u^{\prime })\). Then after some algebra, we have
Similarly, we can obtain
Let \( supp (v^{\prime }) = supp (m) - I\). Then after some algebra, we obtain
Therefore,
We always have \(N> supp (m)\). Moreover, as \(u^{\prime }\) and \(v^{\prime }\) are outlying and normal values respectively, we have \( supp (u^{\prime }) < supp (v^{\prime })\), and thus \(H > I\). Therefore, \(\left( \delta (u^{\prime }) - \phi _{ freq }(u^{\prime })\right) - \left( \delta (v^{\prime }) - \phi _{ freq }(v^{\prime })\right) >0\). \(\square \)
B Convergence and sensitivity test results of CBRW
The empirical convergence analysis for CBRW is provided in the first subsection, followed by the sensitivity test of CBRW w.r.t. the parameter \(\alpha \).
1.1 B.1 Convergence test
The convergence rate of random walks is governed by two key graph properties - the graph diameter and the Cheeger constant (Diaconis and Stroock 1991; Fill 1991). The runtime for computing the Cheeger constant is prohibitive for large graphs, so we replace this constant with clustering coefficients. The graph’s diameter and clustering coefficients of the value graph for each data set are presented in Table 8. It is clear that all the value graphs has small graph diameter and large clustering coefficient. This is because a value in one feature often co-occurs with most, if not all, of the values in other features. Moreover, there exist linkages between values as long as the values co-occur together, resulting in a highly connected dense value graph. Fast convergence rates are expected for random walks on such graphs (Diaconis and Stroock 1991; Fill 1991).
The convergence test results in Fig. 4 show that CBRW converges quickly on all 15 data sets, i.e., within 70 iterations. CBRW converges after about 10 iterations on 13 data sets, but takes about 70 iterations to converge on Probe and U2R. This is because these two data sets contain a large proportion of feature values having frequencies of less than three. This is particularly true for Probe. As a result, although their overall clustering coefficient is high, its Cheeger constant can be quite small, which leads to slower convergence.
1.2 B.2 Sensitivity test w.r.t. the damping factor \(\alpha \)
CBRW only has one parameter, the damping factor \(\alpha \). The use of \(\alpha \) is to avoid the random walking getting stuck in isolated nodes by offering a small restart probability \((1-\alpha )\), which guarantees the algorithmic convergence while does not affect the effectiveness. \(\alpha =1.0\) is not recommended as this may break the convergence condition. Also, \(\alpha \) should be sufficiently large, e.g., \(\alpha \ge 0.85\), and the underlying graph structure is ignored otherwise. Below we examine the sensitivity of CBRW w.r.t. \(\alpha \) in a wide range of values [0.85, 0.99] by performing direct outlier detection (i.e., \(\text {CBRW}_{\mathrm{od}}\) is used). Fig. 5 reports the AUC results w.r.t. \(\alpha \) on all 15 data sets.
The results show that CBRW performs very stably over a large range of tuning options on most of the data sets, and a large \(\alpha \) is more preferable than a small one. This is because (i) \(\alpha \) is introduced to guarantee the convergence of the CBRW algorithm and it is data-insensitive in terms of effectiveness, which is different from some data-sensitive parameters in other detectors, such as the minimum support in FPOF and the subsampling size in iForest; and (ii) the graph structure and edges weights are carefully designed to highlight the outlying values, and we need to make use of this graph nature by setting a large \(\alpha \). A large \(\alpha \) is needed to achieve the best performance on some data sets, e.g., U2R, APAS, w7a and AD. These data sets may contain some highly noisy values. A large \(\alpha \) is required to increase the gap between the outlierness of outlying values and the highly noisy values. On the other hand, a medium \(\alpha \) is needed to obtain the best performance on other data sets, like CT. This may be because some outlying values in these data sets cannot attract sufficiently large outlierness in the original graph structure, but rather rely on some outlierness propagated through restart probabilities. Therefore, we recommend using a relatively large \(\alpha \) (e.g., \(\alpha =0.95\)) to leverage both cases.
C Key outputs of CBRW and SDRW
We provide several key outputs of CBRW to enable an in-depth understanding of its algorithmic procedures. All these outputs are built upon the toy dataset in Table 1. In Table 9 we present the transition matrix used by the BRWs in CBRW, i.e., \(\mathbf {W}^b\) in Eq. (8), in which each entry \(\mathbf {W}^b(u, v)\) is determined by the inter-feature outlierness influence \(\eta \) and the intra-feature outlier factor \(\delta \). After having \(\mathbf {W}^b\), the power iteration method is used to perform random walks and obtains a stationary probability vector \(\varvec{\pi }^{*}\). In CBRW, these stationary probabilities are used as the outlierness of the values in the value-value graph according to Eq. (10). Table 10 shows the outlierness of each value of the toy dataset, with each outlierness corresponding to one entry of \(\varvec{\pi }^{*}\). It is then followed by the calculation of the outlierness of data objects in Eq. (29) based on the value outlierness. Finally, CBRW produces the outlierness of all data objects as in Table 11. The first data object that is the only genuine outlier is assigned with larger outlierness than all data objects, including the noisy data object #10.
We also provide similar outputs for SDRW. In Table 12 we present the adjacency matrix used by SDRW, i.e., \(\mathbf {C}\) in Eq. (13). Note that the value graph in SDRW is undirected, so we have \(\mathbf {C}(u,v)=\mathbf {C}(v,u)\). We then incorporate the subgraph density-based outlier factor into the matrix and calculate the value outlierness using the closed-form solution in Eq. (21). The resulting value outlierness is shown in Table 13. SDRW finally uses the same object outlierness calculation as in CBRW, i.e., Eq. (29) to obtain the object-level outlier scores. As shown in Table 14, SDRW can also easily identify the outliers in the toy dataset.
It should be noted that the above results are built upon a simple synthetic toy dataset to demonstrate the procedure of our methods; they do not imply any bias and discrimination issues in the applications of our methods to real-world datasets.
Rights and permissions
About this article
Cite this article
Pang, G., Cao, L. & Chen, L. Homophily outlier detection in non-IID categorical data. Data Min Knowl Disc 35, 1163–1224 (2021). https://doi.org/10.1007/s10618-021-00750-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-021-00750-y