Data Mining and Knowledge Discovery

, Volume 31, Issue 2, pp 400–423 | Cite as

Discovering rare categories from graph streams

  • Dawei Zhou
  • Arun Karthikeyan
  • Kangyang Wang
  • Nan Cao
  • Jingrui He
Article
  • 371 Downloads

Abstract

Nowadays, massive graph streams are produced from various real-world applications, such as financial fraud detection, sensor networks, wireless networks. In contrast to the high volume of data, it is usually the case that only a small percentage of nodes within the time-evolving graphs might be of interest to people. Rare category detection (RCD) is an important topic in data mining, focusing on identifying the initial examples from the rare classes in imbalanced data sets. However, most existing techniques for RCD are designed for static data sets, thus not suitable for time-evolving data. In this paper, we introduce a novel setting of RCD on time-evolving graphs. To address this problem, we propose two incremental algorithms, SIRD and BIRD, which are constructed upon existing density-based techniques for RCD. These algorithms exploit the time-evolving nature of the data by dynamically updating the detection models enabling a “time-flexible” RCD. Moreover, to deal with the cases where the exact priors of the minority classes are not available, we further propose a modified version named BIRD-LI based on BIRD. Besides, we also identify a critical task in RCD named query distribution, which targets to allocate the limited budget among multiple time steps, such that the initial examples from the rare classes are detected as early as possible with the minimum labeling cost. The proposed incremental RCD algorithms and various query distribution strategies are evaluated empirically on both synthetic and real data sets.

Keywords

Rare category detection Time-evolving graph Incremental learning 

References

  1. Aggarwal CC, Philip SY (2010) On clustering massive text and categorical data streams. Knowl Inf Syst 24(2):171–196CrossRefGoogle Scholar
  2. Akoglu L, McGlohon M, Faloutsos C (2010) Oddball: spotting anomalies in weighted graphs. In: Pacific-Asia conference on knowledge discovery and data mining, Springer, New York, pp 410–421Google Scholar
  3. Akoglu L, Khandekar R, Kumar V, Parthasarathy S, Rajan D, Wu KL (2014) Fast nearest neighbor search on large time-evolving graphs. In: Joint European conference on machine learning and knowledge discovery in databases, Springer, New York, pp 17–33Google Scholar
  4. Backstrom L, Huttenlocher D, Kleinberg J, Lan X (2006) Group formation in large social networks: membership, growth, and evolution. In: ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, pp 44–54Google Scholar
  5. Berlingerio M, Koutra D, Eliassi-Rad T, Faloutsos C (2012) Netsimile: a scalable approach to size-independent network similarity. In: arXiv:1209.2684
  6. Bettencourt LM, Hagberg AA, Larkey LB (2007) Separating the wheat from the chaff: practical anomaly detection schemes in ecological applications of distributed sensor networks. In: Distributed computing in sensor systems, Springer, New York, pp 223–239Google Scholar
  7. Dasgupta S, Hsu D (2008) Hierarchical sampling for active learning. In: International conference on machine learning, ACM, New York, pp 208–215Google Scholar
  8. Davis M, Liu W, Miller P, Redpath G (2011) Detecting anomalies in graphs with numeric labels. In: ACM international conference on information and knowledge management, ACM, New York, pp 1197–1202Google Scholar
  9. Eberle W, Graves J, Holder L (2010) Insider threat detection using a graph-based approach. J Appl Secur Res 6(1):32–81CrossRefGoogle Scholar
  10. Fan W, Wang X, Wu Y (2013) Incremental graph pattern matching. ACM Trans Database Syst 38(3):18MathSciNetCrossRefMATHGoogle Scholar
  11. Franke C, Gertz M (2008) Detection and exploration of outlier regions in sensor data streams. In: IEEE international conference on data mining workshops, IEEE, Los Alamitos, pp 375–384Google Scholar
  12. Gao J, Liang F, Fan W, Wang C, Sun Y, Han J (2010) On community outliers and their efficient detection in information networks. In: ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, pp 813–822Google Scholar
  13. Gupta M, Gao J, Aggarwal C, Han J (2014) Outlier detection for temporal data. Synth Lect Data Min Knowl Discov 5(1):1–129CrossRefMATHGoogle Scholar
  14. Gupte M, Eliassi-Rad T (2012) Measuring tie strength in implicit social networks. In: Annual ACM web science conference, ACM, New York, pp 109–118Google Scholar
  15. He J, Carbonell JG (2007) Nearest-neighbor-based active learning for rare category detection. In: Advances in neural information processing systems, pp 633–640Google Scholar
  16. He J, Liu Y, Lawrence R (2008) Graph-based rare category detection. In: IEEE international conference on data mining, IEEE, pp 833–838Google Scholar
  17. He J, Tong H, Carbonell J (2010) Rare category characterization. In: IEEE international conference on data mining, IEEE, pp 226–235Google Scholar
  18. Henderson K, Eliassi-Rad T, Faloutsos C, Akoglu L, Li L, Maruhashi K, Prakash BA, Tong H (2010) Metric forensics: a multi-level approach for mining volatile graphs. In: ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, pp 163–172Google Scholar
  19. Hill DJ, Minsker BS, Amir E (2007) Real-time bayesian anomaly detection for environmental sensor data. In: Congress-international association for hydraulic research, Citeseer, vol 32, p 503Google Scholar
  20. Kang U, McGlohon M, Akoglu L, Faloutsos C (2010) Patterns on the connected components of terabyte-scale graphs. In: IEEE international conference on data mining, IEEE, pp 875–880Google Scholar
  21. Kang U, Tsourakakis CE, Appel AP, Faloutsos C, Leskovec J (2011) Hadi: mining radii of large graphs. ACM Trans Knowl Discov Data 5(2):8CrossRefGoogle Scholar
  22. Koutra D, Ke TY, Kang U, Chau DHP, Pao HKK, Faloutsos C (2011) Unifying guilt-by-association approaches: theorems and fast algorithms. In: Joint European conference on machine learning and knowledge discovery in databases, Springer, New York, pp 245–260Google Scholar
  23. Koutra D, Papalexakis EE, Faloutsos C (2012) Tensorsplat: spotting latent anomalies in time. In: Panhellenic conference on informatics, IEEE, pp 144–149Google Scholar
  24. Kumar R, Mahdian M, McGlohon M (2010) Dynamics of conversations. In: ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 553–562Google Scholar
  25. Leskovec J, Kleinberg J, Faloutsos C (2005) Graphs over time: densification laws, shrinking diameters and possible explanations. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp 177–187Google Scholar
  26. Liu Z, Chiew K, He Q, Huang H, Huang B (2014) Prior-free rare category detection: more effective and efficient solutions. Expert Syst Appl 41(17):7691–7706CrossRefGoogle Scholar
  27. Müller E, Sánchez PI, Mülle Y, Böhm K (2013) Ranking outlier nodes in subspaces of attributed graphs. In: IEEE international conference on data engineering workshops, IEEE, pp 216–222Google Scholar
  28. Pelleg D, Moore AW (2004) Active learning for anomaly and rare-category detection. In: Advances in neural information processing systems, pp 1073–1080Google Scholar
  29. Phua C, Lee V, Smith K, Gayler R (2010) A comprehensive survey of data mining-based fraud detection research. arXiv:hep-th/10096119
  30. Sherman J, Morrison WJ (1950) Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Annals Math Stat 21(1):124–127MathSciNetCrossRefMATHGoogle Scholar
  31. Sricharan K, Das K (2014) Localizing anomalous changes in time-evolving graphs. In: ACM SIGMOD international conference on management of data, ACM, pp 1347–1358Google Scholar
  32. Tong H, Papadimitriou S, Philip SY, Faloutsos C (2008) Proximity tracking on time-evolving bipartite graphs. In: SIAM international conference in data mining, pp 704–715Google Scholar
  33. Yamanishi K, Takeuchi Ji (2002) A unifying framework for detecting outliers and change points from non-stationary time series data. In: ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 676–681Google Scholar
  34. Yamanishi K, Takeuchi JI, Williams G, Milne P (2004) On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. Data Min Knowl Discov 8(3):275–300MathSciNetCrossRefGoogle Scholar
  35. Zhou D, He J, Candan K, Davulcu H (2015a) Muvir: Multi-view rare category detection. In: International joint conference on artificial intelligence, pp 4098–4104Google Scholar
  36. Zhou D, Wang K, Cao N, He J (2015b) Rare category detection on time-evolving graphs. In: IEEE international conference on data mining, IEEE, pp 1135–1140Google Scholar

Copyright information

© The Author(s) 2016

Authors and Affiliations

  1. 1.CIDSEArizona State UniversityTempeUSA
  2. 2.New York University ShanghaiShanghaiChina

Personalised recommendations