Skip to main content

Advertisement

Log in

Detecting anomaly collections using extreme feature ranks

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Detecting anomaly collections is an important task with many applications, including spam and fraud detection. In an anomaly collection, entities often operate in collusion and hold different agendas to normal entities. As a result, they usually manifest collective extreme traits, i.e., members of an anomaly collection are consistently clustered toward the top or bottom ranks on certain features. We therefore propose to detect these anomaly collections by extreme feature ranks. We introduce a novel anomaly definition called Extreme Rank Anomalous Collection or ERAC. We propose a new measure of anomalousness capturing collective extreme traits based on a statistical model. As there can be a large number of ERACs of various sizes, for simplicity, we first investigate the ERAC detection problem of finding top-\(K\) ERACs of a predefined size limit. We then tackle the follow-up ERAC expansion problem of uncovering the supersets of the detected ERACs that are more anomalous without any size constraint. Algorithms are proposed for both ERAC detection and expansion problems, followed by studies of their performance in four datasets. Specifically, in synthetic datasets, both ERAC detection and expansion algorithms demonstrate high precisions and recalls. In a web spam dataset, both ERAC detection and expansion algorithms discover web spammers with higher precisions than existing approaches. In an IMDB dataset, both ERAC detection and expansion algorithms identify unusual actor collections that are not easily identified by clustering-based methods. In a Chinese online forum dataset, our ERAC detection algorithm identifies suspicious “water army” spammer collections agreed by human evaluators. ERAC expansion algorithm successfully reveals two larger spammer collections with different spamming behaviors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. In order to control the type 1 error (false positive), the significance level for each individual test should be adjusted. We adopt the Bonferroni Correction by Dunnett (1955) to adjust the significance level to \(\alpha /|F|\).

  2. http://barcelona.research.yahoo.net/webspam/datasets.

  3. http://rdf.dmoz.org/.

  4. www.cs.waikato.ac.nz/ml/weka.

  5. http://www.imdb.com/interfaces.

  6. http://en.wikipedia.org/wiki/Internet_Water_Army.

  7. www.360.cn, the number 1 computer network security service provider in China.

  8. www.QQ.com, the number 1 instant messaging and online community service provider in China.

  9. http://www.chinadaily.com.cn/bizchina/2010-11/05/content_11509557.htm.

  10. http://www.chinadaily.com.cn/china/2010-10/21/content_11437735.htm.

  11. http://www.tianya.cn.

  12. http://www.shuijunshiwan.com/wenku/.

References

  • Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Very Large Data Bases Conference (VLDB), pp 487–499

  • Arias-Castro E, Candes EJ, Durand A (2011) Detection of an anomalous cluster in a network. Ann Stat 39(1):278–304

    Article  MATH  MathSciNet  Google Scholar 

  • Avis D, Fukuda K (1993) Reverse search for enumeration. Discret Appl Math 65:21–46

    Article  MathSciNet  Google Scholar 

  • Barnett V, Lewis T (1994) Outliers statistical data. Wiley, New York

    MATH  Google Scholar 

  • Becchetti L, Castillo C, Donato D, Baeza-Yates R, Leonardi S (2008) Link analysis for web spam detection. ACM Trans Web 2(1):2:1–2:42

    Article  Google Scholar 

  • Bomze IM, Budinich M, Pardalos PM, Pelillo M (1999) The maximum clique problem. Kluwer Academic Publishers, Norwell

    Book  Google Scholar 

  • Breunig MM, Kriegel HP, Ng RT, Sander J (2000) Lof: identifying density-based local outliers. In: ACM sigmod record, pp 93–104

  • Castillo C, Donato D, Gionis A, Murdock V, Silvestri F (2007) Know your neighbors: web spam detection using the web topology. In: International Conference on Research and Development in Information Retrieval (SIGIR), pp 423–430

  • Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15:1–15:58

    Article  Google Scholar 

  • Chua CEH, Wareham J (2004) Fighting internet auction fraud: an assessment and proposal. Computer 37(10):31–37

    Article  Google Scholar 

  • Dai H, Zhu F, Lim EP, Pang H (2012) Detecting extreme rank anomalous collections. In: SIAM International Conference on Data Mining (SDM), pp 883–894

  • Duan L, Xu L, Liu Y, Lee J (2009) Cluster-based outlier detection. Ann Oper Res 168(1):151–168

    Article  MATH  MathSciNet  Google Scholar 

  • Dunnett CW (1955) A multiple comparison procedure for comparing several treatments with a control. J Am Stat Assoc 50(272):1096–1121

    Article  MATH  Google Scholar 

  • Eppstein D (2005) All maximal independent sets and dynamic dominance for sparse graphs. In: Symposium on Discrete Algorithms (SODA), pp 451–459

  • Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: International Conference on Knowledge Discovery and Data Mining (KDD), pp 226–231

  • Fetterly D, Manasse M, Najork M (2004) Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In: Proceedings of the international workshop on the web and databases, pp 1–6

  • Guha S, Rastogi R, Shim K (1999) Rock: a robust clustering algorithm for categorical attributes. In: International Conference on Data Engineering (ICDE), pp 512–521

  • Gyöngyi Z, Garcia-Molina H, Pedersen J (2004) Combating web spam with trustrank. In: Very Large Data Bases Conference (VLDB), pp 576–587

  • Gyöngyi Z, Berkhin P, Garcia-Molina H, Pedersen J (2006) Link spam detection based on mass estimation. In: Very Large Data Bases Conference (VLDB), pp 439–450

  • Harkness WL (1965) Properties of the extended hypergeometric distribution. Ann Math Stat 36(3):938–945

    Article  MATH  MathSciNet  Google Scholar 

  • He Z, Xu X, Deng S (2003) Discovering cluster-based local outliers. Pattern Recogn Lett 24(9):1641–1650

    Article  MATH  Google Scholar 

  • Herrera F, Carmona CJ, Gonzalez P, del Jesus MJ (2011) An overview on subgroup discovery: foundations and applications. Knowl Inf Syst 29(3):495–525

    Article  Google Scholar 

  • Kendall M (1948) Rank correlation methods. Griffin

  • Klösgen W (1996) Advances in knowledge discovery and data mining. American Association for Artificial Intelligence

  • Knorr EM, Ng RT (1998) Algorithms for mining distance-based outliers in large datasets. In: Very Large Data Bases Conference (VLDB), pp 392–403

  • Lawler EL, Lenstra JK, Kan AHGR (1980) Generating all maximal independent sets: Np-hardness and polynomial-time algorithms. SIAM J Comput 9(3):558–565

    Article  MATH  MathSciNet  Google Scholar 

  • Liu FT, Ting KM, Zhou ZH (2010) On detecting clustered anomalies using sciforest. In: European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), pp 274–290

  • Loureiro A, Torgo L, Soares C (2004) Outlier detection using clustering methods: a data cleaning application. In: Proceedings of the data mining for business workshop

  • Pandit S, Chau DH, Wang S, Faloutsos C (2007) Netprobe: a fast and scalable system for fraud detection in online auction networks. In: International World Wide Web Conference (WWW), pp 201–210

  • Wrobel S (1997) An algorithm for multi-relational discovery of subgroups. In: European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), pp 78–87

  • Wu T (1993) An accurate computation of the hypergeometric distribution function. ACM Trans Math Softw 19(1):33–43

Download references

Acknowledgments

Part of the work was done when the first author was pursuing PhD in School of Information Systems, Singapore Management University, Singapore. This work is partly supported by National Nature Science Foundation of China (NSFC Grant No. 61300126).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hanbo Dai.

Additional information

Responsible editor: Johannes Furnkranz.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dai, H., Zhu, F., Lim, EP. et al. Detecting anomaly collections using extreme feature ranks. Data Min Knowl Disc 29, 689–731 (2015). https://doi.org/10.1007/s10618-014-0360-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-014-0360-3

Keywords

Navigation