Detecting anomaly collections using extreme feature ranks

Dai, Hanbo; Zhu, Feida; Lim, Ee-Peng; Pang, HweeHwa

doi:10.1007/s10618-014-0360-3

Detecting anomaly collections using extreme feature ranks

Published: 15 July 2014

Volume 29, pages 689–731, (2015)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Hanbo Dai¹,
Feida Zhu²,
Ee-Peng Lim² &
…
HweeHwa Pang²

621 Accesses
5 Citations
Explore all metrics

Abstract

Detecting anomaly collections is an important task with many applications, including spam and fraud detection. In an anomaly collection, entities often operate in collusion and hold different agendas to normal entities. As a result, they usually manifest collective extreme traits, i.e., members of an anomaly collection are consistently clustered toward the top or bottom ranks on certain features. We therefore propose to detect these anomaly collections by extreme feature ranks. We introduce a novel anomaly definition called Extreme Rank Anomalous Collection or ERAC. We propose a new measure of anomalousness capturing collective extreme traits based on a statistical model. As there can be a large number of ERACs of various sizes, for simplicity, we first investigate the ERAC detection problem of finding top-\(K\) ERACs of a predefined size limit. We then tackle the follow-up ERAC expansion problem of uncovering the supersets of the detected ERACs that are more anomalous without any size constraint. Algorithms are proposed for both ERAC detection and expansion problems, followed by studies of their performance in four datasets. Specifically, in synthetic datasets, both ERAC detection and expansion algorithms demonstrate high precisions and recalls. In a web spam dataset, both ERAC detection and expansion algorithms discover web spammers with higher precisions than existing approaches. In an IMDB dataset, both ERAC detection and expansion algorithms identify unusual actor collections that are not easily identified by clustering-based methods. In a Chinese online forum dataset, our ERAC detection algorithm identifies suspicious “water army” spammer collections agreed by human evaluators. ERAC expansion algorithm successfully reveals two larger spammer collections with different spamming behaviors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ensemble Algorithms for Unsupervised Anomaly Detection

CBOF: Cohesiveness-Based Outlier Factor A Novel Definition of Outlier-ness

Statistical Approaches to Detect Anomalies

Notes

In order to control the type 1 error (false positive), the significance level for each individual test should be adjusted. We adopt the Bonferroni Correction by Dunnett (1955) to adjust the significance level to \(\alpha /|F|\).
http://barcelona.research.yahoo.net/webspam/datasets.
http://rdf.dmoz.org/.
www.cs.waikato.ac.nz/ml/weka.
http://www.imdb.com/interfaces.
http://en.wikipedia.org/wiki/Internet_Water_Army.
www.360.cn, the number 1 computer network security service provider in China.
www.QQ.com, the number 1 instant messaging and online community service provider in China.
http://www.chinadaily.com.cn/bizchina/2010-11/05/content_11509557.htm.
http://www.chinadaily.com.cn/china/2010-10/21/content_11437735.htm.
http://www.tianya.cn.
http://www.shuijunshiwan.com/wenku/.

References

Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Very Large Data Bases Conference (VLDB), pp 487–499
Arias-Castro E, Candes EJ, Durand A (2011) Detection of an anomalous cluster in a network. Ann Stat 39(1):278–304
Article MATH MathSciNet Google Scholar
Avis D, Fukuda K (1993) Reverse search for enumeration. Discret Appl Math 65:21–46
Article MathSciNet Google Scholar
Barnett V, Lewis T (1994) Outliers statistical data. Wiley, New York
MATH Google Scholar
Becchetti L, Castillo C, Donato D, Baeza-Yates R, Leonardi S (2008) Link analysis for web spam detection. ACM Trans Web 2(1):2:1–2:42
Article Google Scholar
Bomze IM, Budinich M, Pardalos PM, Pelillo M (1999) The maximum clique problem. Kluwer Academic Publishers, Norwell
Book Google Scholar
Breunig MM, Kriegel HP, Ng RT, Sander J (2000) Lof: identifying density-based local outliers. In: ACM sigmod record, pp 93–104
Castillo C, Donato D, Gionis A, Murdock V, Silvestri F (2007) Know your neighbors: web spam detection using the web topology. In: International Conference on Research and Development in Information Retrieval (SIGIR), pp 423–430
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15:1–15:58
Article Google Scholar
Chua CEH, Wareham J (2004) Fighting internet auction fraud: an assessment and proposal. Computer 37(10):31–37
Article Google Scholar
Dai H, Zhu F, Lim EP, Pang H (2012) Detecting extreme rank anomalous collections. In: SIAM International Conference on Data Mining (SDM), pp 883–894
Duan L, Xu L, Liu Y, Lee J (2009) Cluster-based outlier detection. Ann Oper Res 168(1):151–168
Article MATH MathSciNet Google Scholar
Dunnett CW (1955) A multiple comparison procedure for comparing several treatments with a control. J Am Stat Assoc 50(272):1096–1121
Article MATH Google Scholar
Eppstein D (2005) All maximal independent sets and dynamic dominance for sparse graphs. In: Symposium on Discrete Algorithms (SODA), pp 451–459
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: International Conference on Knowledge Discovery and Data Mining (KDD), pp 226–231
Fetterly D, Manasse M, Najork M (2004) Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In: Proceedings of the international workshop on the web and databases, pp 1–6
Guha S, Rastogi R, Shim K (1999) Rock: a robust clustering algorithm for categorical attributes. In: International Conference on Data Engineering (ICDE), pp 512–521
Gyöngyi Z, Garcia-Molina H, Pedersen J (2004) Combating web spam with trustrank. In: Very Large Data Bases Conference (VLDB), pp 576–587
Gyöngyi Z, Berkhin P, Garcia-Molina H, Pedersen J (2006) Link spam detection based on mass estimation. In: Very Large Data Bases Conference (VLDB), pp 439–450
Harkness WL (1965) Properties of the extended hypergeometric distribution. Ann Math Stat 36(3):938–945
Article MATH MathSciNet Google Scholar
He Z, Xu X, Deng S (2003) Discovering cluster-based local outliers. Pattern Recogn Lett 24(9):1641–1650
Article MATH Google Scholar
Herrera F, Carmona CJ, Gonzalez P, del Jesus MJ (2011) An overview on subgroup discovery: foundations and applications. Knowl Inf Syst 29(3):495–525
Article Google Scholar
Kendall M (1948) Rank correlation methods. Griffin
Klösgen W (1996) Advances in knowledge discovery and data mining. American Association for Artificial Intelligence
Knorr EM, Ng RT (1998) Algorithms for mining distance-based outliers in large datasets. In: Very Large Data Bases Conference (VLDB), pp 392–403
Lawler EL, Lenstra JK, Kan AHGR (1980) Generating all maximal independent sets: Np-hardness and polynomial-time algorithms. SIAM J Comput 9(3):558–565
Article MATH MathSciNet Google Scholar
Liu FT, Ting KM, Zhou ZH (2010) On detecting clustered anomalies using sciforest. In: European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), pp 274–290
Loureiro A, Torgo L, Soares C (2004) Outlier detection using clustering methods: a data cleaning application. In: Proceedings of the data mining for business workshop
Pandit S, Chau DH, Wang S, Faloutsos C (2007) Netprobe: a fast and scalable system for fraud detection in online auction networks. In: International World Wide Web Conference (WWW), pp 201–210
Wrobel S (1997) An algorithm for multi-relational discovery of subgroups. In: European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), pp 78–87
Wu T (1993) An accurate computation of the hypergeometric distribution function. ACM Trans Math Softw 19(1):33–43

Download references

Acknowledgments

Part of the work was done when the first author was pursuing PhD in School of Information Systems, Singapore Management University, Singapore. This work is partly supported by National Nature Science Foundation of China (NSFC Grant No. 61300126).

Author information

Authors and Affiliations

The School of Computer Science and Information Engineering, Hubei University, Wuhan, China
Hanbo Dai
The School of Information Systems, Singapore Management University, Singapore, Singapore
Feida Zhu, Ee-Peng Lim & HweeHwa Pang

Authors

Hanbo Dai
View author publications
You can also search for this author in PubMed Google Scholar
Feida Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Ee-Peng Lim
View author publications
You can also search for this author in PubMed Google Scholar
HweeHwa Pang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hanbo Dai.

Additional information

Responsible editor: Johannes Furnkranz.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dai, H., Zhu, F., Lim, EP. et al. Detecting anomaly collections using extreme feature ranks. Data Min Knowl Disc 29, 689–731 (2015). https://doi.org/10.1007/s10618-014-0360-3

Download citation

Received: 24 February 2013
Accepted: 04 June 2014
Published: 15 July 2014
Issue Date: May 2015
DOI: https://doi.org/10.1007/s10618-014-0360-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Detecting anomaly collections using extreme feature ranks

Abstract

Access this article

Similar content being viewed by others

Ensemble Algorithms for Unsupervised Anomaly Detection

CBOF: Cohesiveness-Based Outlier Factor A Novel Definition of Outlier-ness

Statistical Approaches to Detect Anomalies

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Detecting anomaly collections using extreme feature ranks

Abstract

Access this article

Similar content being viewed by others

Ensemble Algorithms for Unsupervised Anomaly Detection

CBOF: Cohesiveness-Based Outlier Factor A Novel Definition of Outlier-ness

Statistical Approaches to Detect Anomalies

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation