Abstract
Detecting anomaly collections is an important task with many applications, including spam and fraud detection. In an anomaly collection, entities often operate in collusion and hold different agendas to normal entities. As a result, they usually manifest collective extreme traits, i.e., members of an anomaly collection are consistently clustered toward the top or bottom ranks on certain features. We therefore propose to detect these anomaly collections by extreme feature ranks. We introduce a novel anomaly definition called Extreme Rank Anomalous Collection or ERAC. We propose a new measure of anomalousness capturing collective extreme traits based on a statistical model. As there can be a large number of ERACs of various sizes, for simplicity, we first investigate the ERAC detection problem of finding top-\(K\) ERACs of a predefined size limit. We then tackle the follow-up ERAC expansion problem of uncovering the supersets of the detected ERACs that are more anomalous without any size constraint. Algorithms are proposed for both ERAC detection and expansion problems, followed by studies of their performance in four datasets. Specifically, in synthetic datasets, both ERAC detection and expansion algorithms demonstrate high precisions and recalls. In a web spam dataset, both ERAC detection and expansion algorithms discover web spammers with higher precisions than existing approaches. In an IMDB dataset, both ERAC detection and expansion algorithms identify unusual actor collections that are not easily identified by clustering-based methods. In a Chinese online forum dataset, our ERAC detection algorithm identifies suspicious “water army” spammer collections agreed by human evaluators. ERAC expansion algorithm successfully reveals two larger spammer collections with different spamming behaviors.
Similar content being viewed by others
Notes
In order to control the type 1 error (false positive), the significance level for each individual test should be adjusted. We adopt the Bonferroni Correction by Dunnett (1955) to adjust the significance level to \(\alpha /|F|\).
www.360.cn, the number 1 computer network security service provider in China.
www.QQ.com, the number 1 instant messaging and online community service provider in China.
References
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Very Large Data Bases Conference (VLDB), pp 487–499
Arias-Castro E, Candes EJ, Durand A (2011) Detection of an anomalous cluster in a network. Ann Stat 39(1):278–304
Avis D, Fukuda K (1993) Reverse search for enumeration. Discret Appl Math 65:21–46
Barnett V, Lewis T (1994) Outliers statistical data. Wiley, New York
Becchetti L, Castillo C, Donato D, Baeza-Yates R, Leonardi S (2008) Link analysis for web spam detection. ACM Trans Web 2(1):2:1–2:42
Bomze IM, Budinich M, Pardalos PM, Pelillo M (1999) The maximum clique problem. Kluwer Academic Publishers, Norwell
Breunig MM, Kriegel HP, Ng RT, Sander J (2000) Lof: identifying density-based local outliers. In: ACM sigmod record, pp 93–104
Castillo C, Donato D, Gionis A, Murdock V, Silvestri F (2007) Know your neighbors: web spam detection using the web topology. In: International Conference on Research and Development in Information Retrieval (SIGIR), pp 423–430
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15:1–15:58
Chua CEH, Wareham J (2004) Fighting internet auction fraud: an assessment and proposal. Computer 37(10):31–37
Dai H, Zhu F, Lim EP, Pang H (2012) Detecting extreme rank anomalous collections. In: SIAM International Conference on Data Mining (SDM), pp 883–894
Duan L, Xu L, Liu Y, Lee J (2009) Cluster-based outlier detection. Ann Oper Res 168(1):151–168
Dunnett CW (1955) A multiple comparison procedure for comparing several treatments with a control. J Am Stat Assoc 50(272):1096–1121
Eppstein D (2005) All maximal independent sets and dynamic dominance for sparse graphs. In: Symposium on Discrete Algorithms (SODA), pp 451–459
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: International Conference on Knowledge Discovery and Data Mining (KDD), pp 226–231
Fetterly D, Manasse M, Najork M (2004) Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In: Proceedings of the international workshop on the web and databases, pp 1–6
Guha S, Rastogi R, Shim K (1999) Rock: a robust clustering algorithm for categorical attributes. In: International Conference on Data Engineering (ICDE), pp 512–521
Gyöngyi Z, Garcia-Molina H, Pedersen J (2004) Combating web spam with trustrank. In: Very Large Data Bases Conference (VLDB), pp 576–587
Gyöngyi Z, Berkhin P, Garcia-Molina H, Pedersen J (2006) Link spam detection based on mass estimation. In: Very Large Data Bases Conference (VLDB), pp 439–450
Harkness WL (1965) Properties of the extended hypergeometric distribution. Ann Math Stat 36(3):938–945
He Z, Xu X, Deng S (2003) Discovering cluster-based local outliers. Pattern Recogn Lett 24(9):1641–1650
Herrera F, Carmona CJ, Gonzalez P, del Jesus MJ (2011) An overview on subgroup discovery: foundations and applications. Knowl Inf Syst 29(3):495–525
Kendall M (1948) Rank correlation methods. Griffin
Klösgen W (1996) Advances in knowledge discovery and data mining. American Association for Artificial Intelligence
Knorr EM, Ng RT (1998) Algorithms for mining distance-based outliers in large datasets. In: Very Large Data Bases Conference (VLDB), pp 392–403
Lawler EL, Lenstra JK, Kan AHGR (1980) Generating all maximal independent sets: Np-hardness and polynomial-time algorithms. SIAM J Comput 9(3):558–565
Liu FT, Ting KM, Zhou ZH (2010) On detecting clustered anomalies using sciforest. In: European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), pp 274–290
Loureiro A, Torgo L, Soares C (2004) Outlier detection using clustering methods: a data cleaning application. In: Proceedings of the data mining for business workshop
Pandit S, Chau DH, Wang S, Faloutsos C (2007) Netprobe: a fast and scalable system for fraud detection in online auction networks. In: International World Wide Web Conference (WWW), pp 201–210
Wrobel S (1997) An algorithm for multi-relational discovery of subgroups. In: European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), pp 78–87
Wu T (1993) An accurate computation of the hypergeometric distribution function. ACM Trans Math Softw 19(1):33–43
Acknowledgments
Part of the work was done when the first author was pursuing PhD in School of Information Systems, Singapore Management University, Singapore. This work is partly supported by National Nature Science Foundation of China (NSFC Grant No. 61300126).
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Johannes Furnkranz.
Rights and permissions
About this article
Cite this article
Dai, H., Zhu, F., Lim, EP. et al. Detecting anomaly collections using extreme feature ranks. Data Min Knowl Disc 29, 689–731 (2015). https://doi.org/10.1007/s10618-014-0360-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-014-0360-3