Skip to main content
Log in

Homophily outlier detection in non-IID categorical data

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Most of existing outlier detection methods assume that the outlier factors (i.e., outlierness scoring measures) of data entities (e.g., feature values and data objects) are Independent and Identically Distributed (IID). This assumption does not hold in real-world applications where the outlierness of different entities is dependent on each other and/or taken from different probability distributions (non-IID). This may lead to the failure of detecting important outliers that are too subtle to be identified without considering the non-IID nature. The issue is even intensified in more challenging contexts, e.g., high-dimensional data with many noisy features. This work introduces a novel outlier detection framework and its two instances to identify outliers in categorical data by capturing non-IID outlier factors. Our approach first defines and incorporates distribution-sensitive outlier factors and their interdependence into a value-value graph-based representation. It then models an outlierness propagation process in the value graph to learn the outlierness of feature values. The learned value outlierness allows for either direct outlier detection or outlying feature selection. The graph representation and mining approach is employed here to well capture the rich non-IID characteristics. Our empirical results on 15 real-world data sets with different levels of data complexities show that (i) the proposed outlier detection methods significantly outperform five state-of-the-art methods at the 95%/99% confidence level, achieving 10–28% AUC improvement on the 10 most complex data sets; and (ii) the proposed feature selection methods significantly outperform three competing methods in enabling subsequent outlier detection of two different existing detectors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. Having excessive thirsty, weight loss and frequent urination are the abnormal concurrent symptoms in diagnosing type-2 diabetes according to https://www.diabetesaustralia.com.au/.

  2. We have ignored features with \( freq (m)=1\), as those features contain no useful information relevant to outlier detection.

  3. \(\delta \) is normalized into the range in (0,1) to work well with \(\eta \).

  4. The source codes of CBRW/SDRW-based outlier detection (or feature selection) algorithms are made publicly available at https://sites.google.com/site/gspangsite/sourcecode.

  5. Since CompreX was implemented in a different programming language to the other methods, the runtime between CompreX and other methods is incomparable. Instead, we compare them in terms of runtime ratio, i.e., the runtime on a larger/higher-dimensional data set divided by that on a smaller/lower-dimensional data set, for a fairer comparison. Since the data size and the increasing factor of dimensionality are fixed, the runtime ratio is comparable across the methods in different programming languages.

References

  • Aggarwal CC (2017a) Outlier detection in categorical, text, and mixed attribute data. In: Outlier analysis, pp 249–272. Springer, Berlin

  • Aggarwal CC (2017b) Outlier analysis, second edn. Springer, Berlin

    Book  MATH  Google Scholar 

  • Akoglu L, Tong H, Vreeken J, Faloutsos C (2012) Fast and reliable anomaly detection in categorical data. In: CIKM, pp 415–424. ACM

  • Akoglu L, Tong H, Koutra D (2015) Graph based anomaly detection and description: a survey. Data Min Knowl Disc 29(3):626–688

    Article  MathSciNet  Google Scholar 

  • Andersen R, Chellapilla K (2009) Finding dense subgraphs with size bounds. In: Algorithms and models for the web-graph, pp 25–37

  • Angiulli F, Palopoli L et al (2008) Outlier detection using default reasoning. Artif Intell 172(16–17):1837–1872

    Article  MathSciNet  MATH  Google Scholar 

  • Angiulli F, Fassetti F, Palopoli L (2009) Detecting outlying properties of exceptional objects. ACM Trans Datab Syst 34(1):7

    Google Scholar 

  • Angiulli F, Ben-Eliyahu-Zohary R, Palopoli L (2010) Outlier detection for simple default theories. Artif Intell 174(15):1247–1253

    Article  MathSciNet  MATH  Google Scholar 

  • Azmandian F, Yilmazer A, Dy JG, Aslam J, Kaeli DR, et al (2012) GPU-accelerated feature selection for outlier detection using the local kernel density ratio. In ICDM, pp 51–60. IEEE

  • Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: SDM, pp 243–254. SIAM

  • Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. ACM SIGMOD Record 29(2):93–104

    Article  Google Scholar 

  • Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. ACM SIGMOD Record 26(2):265–276

    Article  Google Scholar 

  • Campos GO, Zimek A, Sander J, Campello RJGB, Micenková B, Schubert E, Assent I, Houle ME (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Disc 30(4):891–927

    Article  MathSciNet  Google Scholar 

  • Cao L (2014) Non-iidness learning in behavioral and social data. Comput J 57(9):1358–1370

    Article  Google Scholar 

  • Cao L (2015) Coupling learning of complex interactions. Inf Process Manag 51(2):167–186

    Article  Google Scholar 

  • Cao L (2018) Data science thinking: the next scientific. Technological and Economic Revolution, Springer, Berlin

  • Cao L, Yuming O, Philip SY (2012) Coupled behavior analysis with applications. IEEE Trans Knowl Data Eng 24(8):1378–1392

    Article  Google Scholar 

  • Cao L, Dong X, Zheng Z (2016) e-nsp: Efficient negative sequential pattern mining. Artif Intell 235:156–182

    Article  MathSciNet  MATH  Google Scholar 

  • Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15

    Article  Google Scholar 

  • Chau DH, Nachenberg C, Wilhelm J, Wright A, Faloutsos C (2011) Polonium: Tera-scale graph mining and inference for malware detection. In: SDM, pp 131–142. SIAM

  • Das K, Schneider J (2007) Detecting anomalous records in categorical datasets. In: KDD, pp 220–229. ACM

  • Diaconis P, Stroock D (1991) Geometric bounds for eigenvalues of markov chains. Ann Appl Probab 1(1):36–61

    Article  MathSciNet  MATH  Google Scholar 

  • Emmott AF, Das S, Dietterich T, Fern A, Wong W-K (2013) Systematic construction of anomaly detection benchmarks from real data. In: KDD workshop, pp 16–21. ACM

  • Fan X, Xu RYD, Cao L (2016) Copula mixed-membership stochastic blockmodel. In: IJCAI, pp 1462–1468

  • Fill JA (1991) Eigenvalue bounds on convergence to stationarity for nonreversible markov chains, with an application to the exclusion process. Ann Appl Probab 1(1):62–87

    Article  MathSciNet  MATH  Google Scholar 

  • Fowler JH, Christakis NA (2008) Dynamic spread of happiness in a large social network: longitudinal analysis over 20 years in the framingham heart study. BMJ 337:a2338

    Article  Google Scholar 

  • Ganiz MC, George C, Pottenger WM (2011) Higher order naive bayes: a novel non-iid approach to text classification. IEEE Trans Knowl Data Eng 23(7):1022–1034

    Article  Google Scholar 

  • Giacometti A, Soulet A (2016) Anytime algorithm for frequent pattern outlier detection. Int J Data Sci Anal, pp 1–12

  • Gómez-Gardeñes J, Latora V (2008) Entropy rate of diffusion processes on complex networks. Phys Rev E 78(6):065102

    Article  Google Scholar 

  • Guha S, Mishra N, Roy G, Schrijvers O (2016) Robust random cut forest based anomaly detection on streams. In: ICML, pp 2712–2721

  • Gupta M, Gao J, Aggarwal C, Han J (2014) Outlier detection for temporal data. Synth Lect Data Min Knowl Discov 5(1):1–129

    Article  MATH  Google Scholar 

  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18

    Article  Google Scholar 

  • Hand DJ, Till RJ (2001) A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn 45(2):171–186

    Article  MATH  Google Scholar 

  • He J (2017) Learning from data heterogeneity: algorithms and applications. In: IJCAI, pp 5126–5130

  • He J, Carbonell J (2010) Coselection of features and instances for unsupervised rare category analysis. Stat Anal Data Min 3(6):417–430

    Article  MathSciNet  MATH  Google Scholar 

  • He Z, Xu X, Huang ZJ, Deng S (2005) FP-outlier: frequent pattern based outlier detection. Comput Sci Inf Syst 2(1):103–118

    Article  Google Scholar 

  • Ho TK, Basu M (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Intell 24(3):289–300

    Article  Google Scholar 

  • Ienco D, Pensa RG, Meo R (2017) A semisupervised approach to the detection and characterization of outliers in categorical data. IEEE Trans Neural Netw Learn Syst 28(5):1017–1029

    Article  Google Scholar 

  • Jian S, Cao L, Pang G, Lu K, Gao H (2017) Embedding-based representation of categorical data by hierarchical value coupling learning. In: IJCAI, pp 1937–1943

  • Khuller S, Barna S (2009) On finding dense subgraphs. Automata, Languages and Programming, pp 597–608

  • Koufakou A, Georgiopoulos M (2010) A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes. Data Min Knowl Disc 20(2):259–289

    Article  MathSciNet  Google Scholar 

  • Koufakou A, Secretan J, Georgiopoulos M (2011) Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data. Knowl Inf Syst 29(3):697–725

    Article  Google Scholar 

  • Koutra D, Ke T-Y, Kang U, Chau D, Pao H-K, Faloutsos C (2011) Unifying guilt-by-association approaches: theorems and fast algorithms. In: Machine learning and knowledge discovery in databases, pp 245–260

  • Leyva E, González A, Perez R (2015) A set of complexity measures designed for applying meta-learning to instance selection. IEEE Trans Knowl Data Eng 27(2):354–367

    Article  Google Scholar 

  • Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, Liu H (2016) Feature selection: a data perspective. CoRR, arXiv:abs/1601.07996

  • Liang J, Parthasarathy S (2016) Robust contextual outlier detection: Where context meets sparsity. In: Proceedings of the 25th ACM international on conference on information and knowledge management, pp 2167–2172. ACM

  • Liu FT, Ting KM, Zhou Z-H (2012) Isolation-based anomaly detection. ACM Trans Knowl Discov Data 6(1):3:1–3:39

    Article  Google Scholar 

  • Maldonado S, Weber R, Famili F (2014) Feature selection for high-dimensional class-imbalanced data sets using support vector machines. Inf Sci 286:228–246

    Article  Google Scholar 

  • McGlohon M, Bay S, Anderle MG, Steier DM, Faloutsos C (2009) SNARE: a link analytic system for graph labeling and risk detection. In: KDD, pp 1265–1274. ACM

  • McPherson M, Smith-Lovin L, Cook JM (2001) Birds of a feather: homophily in social networks. Ann Rev Sociol 27(1):415–444

    Article  Google Scholar 

  • Meyer CD (2000) Matrix analysis and applied linear algebra. SIAM, Philadelphia

    Book  Google Scholar 

  • Otey ME, Ghoting A, Parthasarathy S (2006) Fast distributed outlier detection in mixed-attribute data sets. Data Min Knowl Disc 12(2–3):203–228

    Article  MathSciNet  Google Scholar 

  • Page L, Brin S, Motwani R, Winograd T (1998) The PageRank citation ranking: bringing order to the web. In: WWW conference, pp 161–172

  • Pang G, Ting KM, Albrecht D (2015) LeSiNN: detecting anomalies by identifying least similar nearest neighbours. In: ICDM workshop, pp 623–630. IEEE

  • Pang G, Cao L, Chen L (2016) Outlier detection in complex categorical data by modelling the feature value couplings. In IJCAI, pp 1902–1908

  • Pang G, Cao L, Chen L, Lian D, Liu H (2018) Sparse modeling-based sequential ensemble learning for effective outlier detection in high-dimensional numeric data. In: Thirty-second AAAI conference on artificial intelligence

  • Pang G, Shen C, Cao L, van den Hengel A (2020) Deep learning for anomaly detection: a review. arXiv preprint arXiv:2007.02500

  • Rayana S, Akoglu L (2016) Less is more: building selective anomaly ensembles. ACM Trans Knowl Discov Data 10(4):42

    Article  Google Scholar 

  • Rayana S, Zhong W, Akoglu L (2016) Sequential ensemble learning for outlier detection: a bias-variance perspective. In: 2016 IEEE 16th international conference on data mining (ICDM), pp 1167–1172. IEEE

  • Schubert E, Wojdanowski R, Zimek A, Kriegel H-P (2012) On evaluation of outlier rankings and outlier scores. In: Proceedings of the 2012 SIAM international conference on data mining, pp 1047–1058. SIAM

  • Smets K, Vreeken J (2011) The odd one out: identifying and characterising anomalies. In: SDM, pp 109–148. SIAM

  • Smith MR, Martinez T, Giraud-Carrier C (2014) An instance level analysis of data complexity. Mach Learn 95(2):225–256

    Article  MathSciNet  Google Scholar 

  • Sugiyama M, Borgwardt K (2013) Rapid distance-based outlier detection via sampling. In: NIPS, pp 467–475

  • Sun Y, Han J (2012) Mining heterogeneous information networks: principles and methodologies. Synth Lect Data Min Knowl Discov 3(2):1–159

    Article  Google Scholar 

  • Tang G, Pei J, Bailey J, Dong G (2015) Mining multidimensional contextual outliers from categorical relational data. Intell Data Anal 19(5):1171–1192

    Article  Google Scholar 

  • Tang J, Gao H, Hu X, Liu H (2013) Exploiting homophily effect for trust prediction. In: WSDM, pp 53–62. ACM

  • Ting KM, Zhou GT, Liu FT, Tan SC (2013) Mass estimation. Mach Learn 90(1):127–160

    Article  MathSciNet  MATH  Google Scholar 

  • Ting KM, Washio T, Wells JR, Aryal S (2017) Defying the gravity of learning curve: a characteristic of nearest neighbour anomaly detectors. Mach Learn 106(1):55–91

    Article  MathSciNet  MATH  Google Scholar 

  • Wong W-K, Moore A, Cooper G, Wagner M (2003) Bayesian network anomaly pattern detection for disease outbreaks. In: ICML, pp 808–815

  • Shu W, Wang S (2013) Information-theoretic outlier detection for large-scale categorical data. IEEE Trans Knowl Data Eng 25(3):589–602

    Article  Google Scholar 

  • Zhang Q, Cao L, Zhu C, Li Z, Sun J (2018) Coupledcf: learning explicit and implicit user-item couplings in recommendation for deep collaborative filtering. In: IJCAI’2018, pp 3662–3668

  • Zheng G, Brantley SL, Lauvaux T, Li Z (2017) Contextual spatial outlier detection with metric learning. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 2161–2170. ACM

  • Zhou Z-H, Sun Y-Y, Li Y-F (2009) Multi-instance learning by treating instances as non-iid samples. In: ICML, pp 1249–1256. ACM

  • Zimek A, Campello RJGB, Sander J (2013) Ensembles for unsupervised outlier detection: challenges and research questions. ACM SIGKDD Explor Newsl 15(1):11–22

    Article  Google Scholar 

Download references

Acknowledgements

This work was partially supported by the Australian Research Council discovery grant (DP190101079) and ARC Future Fellowship grant (FT190100734).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Longbing Cao.

Additional information

Responsible editor: Srinivasan Parthasarathy.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Proofs of the theorems

1.1 A.1 Proof of Theorem 1

Proof

If the graph \(\mathsf {G}\) is irreducible and aperiodic, then based on the Perron–Frobenius Theorem (Meyer 2000), the URWs on \(\mathsf {G}\) based on the adjacency matrix \(\mathbf {A}\) will converge to a unique probability vector. \(\square \)

Lemma 1

(Equivalence between BRWs and URWs) BRWs based on the adjacency matrix \(\mathbf {A}\) and the bias \(\delta \) is equivalent to URWs on a graph \(\mathsf {G}^{b}\) with an adjacency matrix \(\mathbf {B}\), in which

$$\begin{aligned} \mathbf {B}(u,v)=\delta (u)\mathbf {A}(u,v)\delta (v), \; \forall u,v \in \mathcal {V}. \end{aligned}$$
(35)

This lemma holds iff the transition matrix \(\mathbf {T}\) of \(\mathsf {G}^{b}\) satisfies: \(\mathbf {T} \equiv \mathbf {W}^{b}\). Since \(\mathbf {B}(u,v)=\delta (u)\mathbf {A}(u,v)\delta (v)\), we have

$$\begin{aligned} \mathbf {T}(u,v)&=\frac{\mathbf {B}(u,v)}{\sum _{v \in \mathcal {V}}\mathbf {B}(u,v)}= \frac{\delta (u)\mathbf {A}(u,v)\delta (v)}{\sum _{v\in V}\delta (u)\mathbf {A}(u,v)\delta (v)} \\&=\frac{\mathbf {A}(u,v)\delta (v)}{\sum _{v\in V}\mathbf {A}(u,v)\delta (v)}=\mathbf {W}^{b}(u,v), \end{aligned}$$

which completes the proof of the lemma.

Since \(\delta \) is always positive, the inclusion of \(\delta \) into \(\mathbf {A}\) does not change the graph’s irreducibility and aperiodicity. Based on Lemma 1, \(\mathbf {B}\) and \(\mathbf {A}\) have the same irreducibility and aperiodicity. Therefore, if \(\mathsf {G}\) is irreducible and aperiodic, so is \(\mathsf {G}^{b}\). We therefore have \(\varvec{\pi }^{*}=\mathbf {W}^{b}\varvec{\pi }^{*}\). \(\square \)

1.2 A.2 Proof of Theorem 2

Proof

To prove Eq. (20), we need to show that when \(\varvec{\pi }^{\prime }(u) =\frac{d^{\prime }(u)}{ vol (\mathsf {G}^{\prime })}\), \(\forall u \in \mathcal {V}\), we have \(\varvec{\pi }^{\prime }=\mathbf {W}^{b\prime }\varvec{\pi }^{\prime }\), i.e., \(\varvec{\pi }^{\prime }\) becomes steady w.r.t. the time step.

First, the probability of visiting v is \(\varvec{\pi }^{\prime ,t+1}(v) = \sum _{u\in \mathcal {V}}\varvec{\pi }^{\prime ,t}(u)\mathbf {W}^{b\prime }(u,v)\). We then have

$$\begin{aligned} \varvec{\pi }^{\prime ,t+1}(v) = \sum _{u\in \mathcal {V}}\varvec{\pi }^{\prime ,t}(u)\frac{\mathbf {B}^{\prime }(u,v)}{\sum _{w \in \mathcal {V}}\mathbf {B}^{\prime }_{u,w}}. \end{aligned}$$

When \(\varvec{\pi }^{\prime ,t}(u) =\frac{d^{\prime }(u)}{ vol (\mathsf {G}^{\prime })}\), we have

$$\begin{aligned} \varvec{\pi }^{\prime ,t+1}(v) =\sum _{u\in \mathcal {V}}\frac{d^{\prime }(u)}{ vol (\mathsf {G}^{\prime })}\frac{\mathbf {B}^{\prime }(u,v)}{\sum _{w \in \mathcal {V}}\mathbf {B}^{\prime }_{u,w}}= \sum _{u\in \mathcal {V}}\frac{d^{\prime }(u)}{ vol (\mathsf {G}^{\prime })}\frac{\mathbf {B}^{\prime }(u,v)}{d^{\prime }(u)}=\sum _{u\in \mathcal {V}}\frac{\mathbf {B}^{\prime }(u,v)}{ vol (\mathsf {G}^{\prime })}. \end{aligned}$$

Since \(\mathbf {B}^{\prime }(u,v)= \mathbf {B}^{\prime }(v,u)\), we further have

$$\begin{aligned} \varvec{\pi }^{\prime ,t+1}(v) =\sum _{u\in \mathcal {V}}\frac{\mathbf {B}^{\prime }(u,v)}{ vol (\mathsf {G}^{\prime })}=\sum _{u\in \mathcal {V}}\frac{\mathbf {B}^{\prime }(v,u)}{ vol (\mathsf {G}^{\prime })}= \frac{d^{\prime }(v)}{ vol (\mathsf {G}^{\prime })}. \end{aligned}$$

Therefore, we also have \(\varvec{\pi }^{\prime ,t+1}(u)=\frac{d^{\prime }(u)}{ vol (\mathsf {G}^{\prime })}=\varvec{\pi }^{\prime ,t}(u)\), i.e., \(\varvec{\pi }^{\prime }\) becomes steady. \(\square \)

1.3 A.3 Proof of Theorem 3

Proof

We show \( diff =\left( \delta (u^{\prime }) - \phi _{ freq }(u^{\prime })\right) - \left( \delta (v^{\prime }) - \phi _{ freq }(v^{\prime })\right) > 0\) to complete the proof.

First, we have

$$\begin{aligned} \delta (u^{\prime }) - \phi _{ freq }(u^{\prime })&= \left( 1 - freq (m) + \frac{ freq (m) - freq (u^{\prime })}{ freq (m)}\right) - \left( 1 - freq (u^{\prime })\right) \\&= \frac{ supp (m)N- supp (u^{\prime })N + supp (m) supp (u^{\prime })- supp (m)^{2}}{ supp (m)N}. \end{aligned}$$

Let \(C = N - supp (m)\) and \(H = supp (m) + supp (u^{\prime })\). Then after some algebra, we have

$$\begin{aligned} \delta (u^{\prime }) - \phi _{ freq }(u^{\prime }) = \frac{NH- supp (m)H}{ supp (m)N}. \end{aligned}$$

Similarly, we can obtain

$$\begin{aligned} \delta (v^{\prime }) - \phi _{ freq }(v^{\prime }) = \frac{ supp (m)N- supp (m)^{2}- supp (v^{\prime })N+ supp (v^{\prime }) supp }{ supp (m)N}. \end{aligned}$$

Let \( supp (v^{\prime }) = supp (m) - I\). Then after some algebra, we obtain

$$\begin{aligned} \delta (v^{\prime }) - \phi _{ freq }(v^{\prime }) = \frac{NI- supp (m)I}{ supp (m)N}. \end{aligned}$$

Therefore,

$$\begin{aligned} diff&= \frac{NH- supp (m)H}{ supp (m)N} - \frac{NI- supp (m)I}{ supp (m)N} \\&= \frac{(N- supp (m))(H- I)}{ supp (m)N}. \end{aligned}$$

We always have \(N> supp (m)\). Moreover, as \(u^{\prime }\) and \(v^{\prime }\) are outlying and normal values respectively, we have \( supp (u^{\prime }) < supp (v^{\prime })\), and thus \(H > I\). Therefore, \(\left( \delta (u^{\prime }) - \phi _{ freq }(u^{\prime })\right) - \left( \delta (v^{\prime }) - \phi _{ freq }(v^{\prime })\right) >0\). \(\square \)

B Convergence and sensitivity test results of CBRW

The empirical convergence analysis for CBRW is provided in the first subsection, followed by the sensitivity test of CBRW w.r.t. the parameter \(\alpha \).

Table 8 Two key properties of a value graph
Fig. 4
figure 4

Convergence test results

1.1 B.1 Convergence test

The convergence rate of random walks is governed by two key graph properties - the graph diameter and the Cheeger constant (Diaconis and Stroock 1991; Fill 1991). The runtime for computing the Cheeger constant is prohibitive for large graphs, so we replace this constant with clustering coefficients. The graph’s diameter and clustering coefficients of the value graph for each data set are presented in Table 8. It is clear that all the value graphs has small graph diameter and large clustering coefficient. This is because a value in one feature often co-occurs with most, if not all, of the values in other features. Moreover, there exist linkages between values as long as the values co-occur together, resulting in a highly connected dense value graph. Fast convergence rates are expected for random walks on such graphs (Diaconis and Stroock 1991; Fill 1991).

The convergence test results in Fig. 4 show that CBRW converges quickly on all 15 data sets, i.e., within 70 iterations. CBRW converges after about 10 iterations on 13 data sets, but takes about 70 iterations to converge on Probe and U2R. This is because these two data sets contain a large proportion of feature values having frequencies of less than three. This is particularly true for Probe. As a result, although their overall clustering coefficient is high, its Cheeger constant can be quite small, which leads to slower convergence.

1.2 B.2 Sensitivity test w.r.t. the damping factor \(\alpha \)

CBRW only has one parameter, the damping factor \(\alpha \). The use of \(\alpha \) is to avoid the random walking getting stuck in isolated nodes by offering a small restart probability \((1-\alpha )\), which guarantees the algorithmic convergence while does not affect the effectiveness. \(\alpha =1.0\) is not recommended as this may break the convergence condition. Also, \(\alpha \) should be sufficiently large, e.g., \(\alpha \ge 0.85\), and the underlying graph structure is ignored otherwise. Below we examine the sensitivity of CBRW w.r.t. \(\alpha \) in a wide range of values [0.85, 0.99] by performing direct outlier detection (i.e., \(\text {CBRW}_{\mathrm{od}}\) is used). Fig. 5 reports the AUC results w.r.t. \(\alpha \) on all 15 data sets.

Fig. 5
figure 5

Sensitivity test results w.r.t. the parameter \(\alpha \)

Table 9 Transition matrix resulted in CBRW for the value graph derived from Table 1
Table 10 Value outlierness yielded by CBRW

The results show that CBRW performs very stably over a large range of tuning options on most of the data sets, and a large \(\alpha \) is more preferable than a small one. This is because (i) \(\alpha \) is introduced to guarantee the convergence of the CBRW algorithm and it is data-insensitive in terms of effectiveness, which is different from some data-sensitive parameters in other detectors, such as the minimum support in FPOF and the subsampling size in iForest; and (ii) the graph structure and edges weights are carefully designed to highlight the outlying values, and we need to make use of this graph nature by setting a large \(\alpha \). A large \(\alpha \) is needed to achieve the best performance on some data sets, e.g., U2R, APAS, w7a and AD. These data sets may contain some highly noisy values. A large \(\alpha \) is required to increase the gap between the outlierness of outlying values and the highly noisy values. On the other hand, a medium \(\alpha \) is needed to obtain the best performance on other data sets, like CT. This may be because some outlying values in these data sets cannot attract sufficiently large outlierness in the original graph structure, but rather rely on some outlierness propagated through restart probabilities. Therefore, we recommend using a relatively large \(\alpha \) (e.g., \(\alpha =0.95\)) to leverage both cases.

C Key outputs of CBRW and SDRW

We provide several key outputs of CBRW to enable an in-depth understanding of its algorithmic procedures. All these outputs are built upon the toy dataset in Table 1. In Table 9 we present the transition matrix used by the BRWs in CBRW, i.e., \(\mathbf {W}^b\) in Eq. (8), in which each entry \(\mathbf {W}^b(u, v)\) is determined by the inter-feature outlierness influence \(\eta \) and the intra-feature outlier factor \(\delta \). After having \(\mathbf {W}^b\), the power iteration method is used to perform random walks and obtains a stationary probability vector \(\varvec{\pi }^{*}\). In CBRW, these stationary probabilities are used as the outlierness of the values in the value-value graph according to Eq. (10). Table 10 shows the outlierness of each value of the toy dataset, with each outlierness corresponding to one entry of \(\varvec{\pi }^{*}\). It is then followed by the calculation of the outlierness of data objects in Eq. (29) based on the value outlierness. Finally, CBRW produces the outlierness of all data objects as in Table 11. The first data object that is the only genuine outlier is assigned with larger outlierness than all data objects, including the noisy data object #10.

Table 11 Final object outlierness yielded by CBRW

We also provide similar outputs for SDRW. In Table 12 we present the adjacency matrix used by SDRW, i.e., \(\mathbf {C}\) in Eq. (13). Note that the value graph in SDRW is undirected, so we have \(\mathbf {C}(u,v)=\mathbf {C}(v,u)\). We then incorporate the subgraph density-based outlier factor into the matrix and calculate the value outlierness using the closed-form solution in Eq. (21). The resulting value outlierness is shown in Table 13. SDRW finally uses the same object outlierness calculation as in CBRW, i.e., Eq. (29) to obtain the object-level outlier scores. As shown in Table 14, SDRW can also easily identify the outliers in the toy dataset.

Table 12 Adjacency matrix resulted in SDRW for the value graph derived from table 1
Table 13 Value outlierness yielded by SDRW
Table 14 Final object outlierness yielded by SDRW

It should be noted that the above results are built upon a simple synthetic toy dataset to demonstrate the procedure of our methods; they do not imply any bias and discrimination issues in the applications of our methods to real-world datasets.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pang, G., Cao, L. & Chen, L. Homophily outlier detection in non-IID categorical data. Data Min Knowl Disc 35, 1163–1224 (2021). https://doi.org/10.1007/s10618-021-00750-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-021-00750-y

Keywords

Navigation