Knowledge and Information Systems

, Volume 35, Issue 3, pp 713–736 | Cite as

CLOVER: a faster prior-free approach to rare-category detection

  • Hao Huang
  • Qinming He
  • Kevin Chiew
  • Feng Qian
  • Lianhang Ma
Regular Paper

Abstract

Rare-category detection helps discover new rare classes in an unlabeled data set by selecting their candidate data examples for labeling. Most of the existing approaches for rare-category detection require prior information about the data set without which they are otherwise not applicable. The prior-free algorithms try to address this problem without prior information about the data set; though, the compensation is high time complexity, which is not lower than \(O(dN^2)\) where \(N\) is the number of data examples in a data set and \(d\) is the data set dimension. In this paper, we propose CLOVER a prior-free algorithm by introducing a novel rare-category criterion known as local variation degree (LVD), which utilizes the characteristics of rare classes for identifying rare-class data examples from other types of data examples and passes those data examples with maximum LVD values to CLOVER for labeling. A remarkable improvement is that CLOVER’s time complexity is \(O(dN^{2-1/d})\) for \(d > 1\) or \(O(N\log N)\) for \(d = 1\). Extensive experimental results on real data sets demonstrate the effectiveness and efficiency of our method in terms of new rare classes discovery and lower time complexity.

Keywords

Rare-category detection Local variation degree \(k\)NN M\(k\)NN  Histogram density estimation 

References

  1. 1.
    Agarwal D (2006) Detecting anomalies in cross-classified streams: a bayesian approach. Knowl Inf Syst 11(1):29–44CrossRefGoogle Scholar
  2. 2.
    Aggarwal CC, Yu PS (2008) Outlier detection with uncertain data. In: Proceedings of the 2008 SIAM international conference on data mining (SDM ’08), April 24–26, Atlanta, pp 483–493Google Scholar
  3. 3.
    Aggarwal CC, Yu PS (2010) On clustering massive text and categorical data streams. Knowl Inf Syst 24(2):171–196CrossRefGoogle Scholar
  4. 4.
    Ando S (2007) Clustering needles in a haystack: An information theoretic analysis of minority and outlier detection. In: Proceedings of the 7th IEEE international conference on data mining (ICDM ’07), October 28–31, Omaha, pp 13–22Google Scholar
  5. 5.
    Bay S, Kumaraswamy K, Anderle M, Kumar R, Steier D (2006) Large scale detection of irregularities in accounting data. In: Proceedings of the 6th IEEE international conference on data mining (ICDM ’06), December 18–22, Hong Kong, pp 75–86Google Scholar
  6. 6.
    Blum A, Mitchell T (1998) Combining labeled and unlabeded data with co-training. In: Proceedings of the 11th annual conference on learning theory (COLT ’98), July 24–26, Madison, pp 92–100Google Scholar
  7. 7.
    Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 2008 SIAM international conference on data mining (SDM ’08), April 24–26 Atlanta, pp 243–254Google Scholar
  8. 8.
    Branch JW, Giannella C, Szymanski B, Wolff R, Kargupta H (2012) In-network outlier detection in wireless sensor networks. Knowl Inf Syst. doi:10.1007/s10115-011-0474-5
  9. 9.
    Breunig MM, Kriegel HP, Ng RT, Sander J (2000) Lof: Identifying ddensity-based local outliers. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, May 16–18, Dallas, pp 93–104Google Scholar
  10. 10.
    Calderara S, Heinemann U, Prati A, Cucchiara R, Tishby N (2011) Detecting anomalies in people’s trajectories using spectral graph analysis. Comput Vis Image Underst 115(8):1099–1111CrossRefGoogle Scholar
  11. 11.
    Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):1–58CrossRefGoogle Scholar
  12. 12.
    Dasgupta S, Hsu D (2008) Hierarchical sampling for active learning. In: Proceedings of the 25th international conference on machine learning (ICML ’08), July 5–9, Helsinki, pp 208–215Google Scholar
  13. 13.
    de Vries T, Chawla S, Houle M (2011) Density-preserving projections for large-scale local anomaly detection. Knowl Inf Syst. doi:10.1007/s10115-011-0430-4
  14. 14.
    Dutta H, Giannella C, Borne K, Kargupta H (2007) Distributed top-k outlier detection in astronomy catalogs using the demac system. In: Proceedings of the 2007 SIAM international conference on data mining (SDM ’07), April 26–28, Minneapolis, pp 208–215Google Scholar
  15. 15.
    Fine S, Mansour Y (2006) Active sampling for multiple output identification. In: Proceedings of the 19th annual conference on learning theory (COLT ’06), June 22–25, Pittsburgh, pp 620–634Google Scholar
  16. 16.
    Foss A, Zaïane OR (2011) Class separation through variance: a new application of outlier detection. Knowl Inf Syst 29(3):565–596CrossRefGoogle Scholar
  17. 17.
    Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml
  18. 18.
    Gao Y, Zheng B, Chen G, Li Q (2009) On efficient mutual nearest neighbor query processing in spatial databases. Data Knowl Eng 68(8):705–727CrossRefGoogle Scholar
  19. 19.
    He J, Carbonell J (2007) Nearest-neighbor-based active learning for rare category detection. In: Advances in neural information processing systems (NIPS ’07), vol 20, December 3–6, Vancouver, pp 633–640Google Scholar
  20. 20.
    He J, Carbonell J (2009) Prior-free rare category detection. In: Proceedings of the 2009 SIAM international conference on data mining (SDM ’09), April 30–May 2, Sparks, pp 155–163Google Scholar
  21. 21.
    He J, Liu Y, Lawrence R (2008) Graph-based rare category detection. In: Proceedings of the 8th IEEE international conference on data mining (ICDM ’08), December 15–19, Pisa, pp 833–838Google Scholar
  22. 22.
    He J, Tong H, Carbonell J (2010) Rare category characterization. In: Proceedings of the 10th IEEE international conference on data mining (ICDM ’10), December 14–17, Sydney, pp 226–235Google Scholar
  23. 23.
    He Z, Deng S, Xu X, Huang JZ (2006) A fast greedy algorithm for outlier mining. In: Advances in knowledge discovery and data mining (PAKDD ’06), vol LNCS 3918, April 9–12, Singapore, pp 567–576Google Scholar
  24. 24.
    He Z, Xu X, Deng S (2003) Discovering cluster-based local outliers. Pattern Recogn Lett 24(9–10): 1641–1650Google Scholar
  25. 25.
    Hospedales T, Gong S, Xiang T (2011) Finding rare classes: adapting generative and discriminative models in active learning. In: Advances in knowledge discovery and data mining (PAKDD ’11), vol LNAI 6635, May 24–27, Shenzhen, pp 296–308Google Scholar
  26. 26.
    Huang H, He Q, He J, Ma L (2011) Radar: rare category detection via computation of boundary degree. In: Advances in knowledge discovery and data mining (PAKDD ’11), vol LNCS 6635, May 24–27, Shenzhen, pp 258–269Google Scholar
  27. 27.
    Jian P, Kapoor A (2009) Active learning for large multi-class problems. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR ’09), June 20–25, Miami Beach, pp 762–769Google Scholar
  28. 28.
    Jolliffe I (2002) Principal component analysis, 2nd edn. Springer, HeidelbergMATHGoogle Scholar
  29. 29.
    Leung Y, Zhang JS, Xu ZB (2000) Clustering by scale-space filtering. IEEE Trans Pattern Anal Mach Intell 22(12):1396–1410CrossRefGoogle Scholar
  30. 30.
    Linda O, Vollmer T, Manic M (2009) Neural network based intrusion detection system for critical infrastructures. In: Proceedings of the 2009 international joint conference on neural networks (IJCNN ’09), June 14–19, Atlanta, pp 1827–1834Google Scholar
  31. 31.
    Moore A (1991) A tutorial on kd-trees. University of Cambridge Computer Laboratory Technical, ReportGoogle Scholar
  32. 32.
    Moshtaghi M, Havens T, Bezdek J, Park L, Leckie C, Rajasegarar S, Keller J, Palaniswami M (2011) Clustering ellipses for anomaly detection. Pattern Recogn 44(1):55–69MATHCrossRefGoogle Scholar
  33. 33.
    Otey ME, Ghoting A, Parthasarathy S (2006) Fast distributed outlier detection in mixed-attribute data sets. Data Mining Knowl Discov 12(2–3):203–228Google Scholar
  34. 34.
    Pelleg D, Moore A (2004) Active learning for anomaly and rare-category detection. In: Advances in neural information processing systems (NIPS ’04), vol 18, December 13–18, Vancouver, pp 1073–1080Google Scholar
  35. 35.
    Porter R, Hush D, Harvey N, Theiler J (2010) Toward interactive search in remote sensing imagery. In: Proceedings of SPIE—the international society for optical engineering, vol 7709, April 5 OrlandoGoogle Scholar
  36. 36.
    Rice JA (2006) Mathematical statistics and data analysis, 3rd edn. Duxbury Press, CaliforniaGoogle Scholar
  37. 37.
    Roweis S (1998) Em algorithm for pca and spca. In: Advances in neural information processing systems (NIPS ’98), November 30–December 5, Denver, pp 626–632Google Scholar
  38. 38.
    Scott DW (1992) Multivariate density estimation: theory, practice, and visualization. Wiley, New YorkMATHCrossRefGoogle Scholar
  39. 39.
    Scott DW (2010) Histogram. WIREs Comput Stat 2(1):44–48CrossRefGoogle Scholar
  40. 40.
    Settles B (2010) Active learning literature survey. Technical Report 1648, University of Wisconsin-MadisonGoogle Scholar
  41. 41.
    Sotiris VA, Tse PW, Pecht MG (2010) Anomaly detection through a bayesian support vector machine. IEEE Trans Reliab 59(2):277–286CrossRefGoogle Scholar
  42. 42.
    Tandon G, Chan P (2007) Weighting versus pruning in rule validation for detecting network and host anomalies. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, August 12–15, San Jose, pp 697–706Google Scholar
  43. 43.
    Vatturi P, Wong W-K (2009) Category detection using hierarchical mean shift. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, June 28–July 1, Paris, pp 847–856Google Scholar
  44. 44.
    Wang W, Zhou ZH (2010) A new analysis of co-training. In: Proceedings of the 27th international conference on machine learning (ICML ’10), June 21–24, Haifa, pp 1135–1142Google Scholar
  45. 45.
    Wu J, Xiong H, Wu P, Chen J (2007) Local decomposition for rare class analysis. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, August 12–15, San Jose, pp 814–823Google Scholar

Copyright information

© Springer-Verlag London Limited 2012

Authors and Affiliations

  • Hao Huang
    • 1
  • Qinming He
    • 1
  • Kevin Chiew
    • 2
  • Feng Qian
    • 1
  • Lianhang Ma
    • 1
  1. 1.College of Computer Science and TechnologyZhejiang UniversityHangzhouPeople’s Republic of China
  2. 2.School of EngineeringTan Tao UniversityDuc Hoa DistrictVietnam

Personalised recommendations