Advertisement

Lobachevskii Journal of Mathematics

, Volume 39, Issue 9, pp 1262–1269 | Cite as

Unsupervised Graph Anomaly Detection Algorithms Implemented in Apache Spark

  • A. Semenov
  • A. Mazeev
  • D. Doropheev
  • T. Yusubaliev
Part 1. Special issue “High Performance Data Intensive Computing” Editors: V. V. Voevodin, A. S. Simonov, and A. V. Lapin
  • 3 Downloads

Abstract

The graph anomaly detection problem occurs in many application areas and can be solved by spotting outliers in unstructured collections of multi-dimensional data points, which can be obtained by graph analysis algorithms. We implement the algorithm for the small community analysis and the approximate LOF algorithm based on Locality-Sensitive Hashing, apply the algorithms to a real world graph and evaluate scalability of the algorithms. We use Apache Spark as one of the most popular Big Data frameworks.

Keywords and phrases

Spark graph processing supervised anomaly detection performance evaluation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    D. Reed and J. Dongarra, “Exascale computing and big data: the next frontier,” Commun. ACM 57 (7), 56–68 (2014).CrossRefGoogle Scholar
  2. 2.
    M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets. HotCloud,” in Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, 2010, pp. 10–10. https://doi.org/static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf. Accessed 2018.Google Scholar
  3. 3.
    J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Commun. ACM51, 107–113 (2010).Google Scholar
  4. 4.
    L. Akoglu, H. Tong, and D. Koutra, “Graph based anomaly detection and description: a survey,” Data Min. Knowl. Disc1ov 29, 626–688 (2015). https://doi.org/arxiv.org/pdf/1404.4679.pdf. Accessed 2018.MathSciNetCrossRefGoogle Scholar
  5. 5.
    Z. Li, H. Xiong, and Y. Liu, “Detecting blackholes and volcanoes in directed networks,” arXiv:1005. 2179 (2010). https://doi.org/arxiv.org/pdf/1005.2179.pdf. Accessed 2018.Google Scholar
  6. 6.
    L. Akoglu, M. McGlohon, and C. Faloutsos, “OddBall: Spotting Anomalies in Weighted Graphs,” in Proceedings of the 14th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining PAKDD’10, 2010, Part 2, pp. 410–421. https://doi.org/repository.cmu.edu/cgi/viewcontent.-gi?article=3599&context=compsci. Accessed 2018.CrossRefGoogle Scholar
  7. 7.
    M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF: identifying density-based local outliers,” in Proceedings of the ACM SIGMOD 2000 International Conference on Management of Data, 2010. https://doi.org/www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf. Accessed 2018.Google Scholar
  8. 8.
    R. Weber, H.-J. Schek, and S. Blott, “A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces,” in Proceedings of the 24rd International Conference on Very Large Data Bases (Morgan Kaufmann, 1998), pp. 194–205.Google Scholar
  9. 9.
    D. T. Lee and C. K. Wong, “Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees,” Acta Inf. 9, 23–29 (1977).MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    H. Koga, T. Ishibashi, and T. Watanabe, “Fast agglomerative hierarchical clustering algorithm using localitysensitive hashing,” Knowledge Inf. Syst. 12, 25–53 (2007).CrossRefGoogle Scholar
  11. 11.
    A. Beygelzimer, S. Kakade, and J. Langford, “Cover trees for nearest neighbor,” in Proceedings of the 23rd International Conference on Machine Learning, 2006, pp. 97–104. https://doi.org/homes.cs.washington.-edu/sham/papers/ml/cover_tree.pdf. Accessed 2018.Google Scholar
  12. 12.
    R. Weber and S. Blott, “An approximation-based data structure for similarity search,” Technical Report No. 24, ESPRIT Project HERMES No. 9141 (1997).Google Scholar
  13. 13.
    S. Ramaswamy and K. Rose, “Adaptive cluster-distance bounding for nearest neighbor search in image databases,” IEEE Int. Conf. Image Process. 6, 381–384 (2007). https://doi.org/citeseerx.ist.psu.edu/viewdoc/-download?doi=10.1.1.80.6562&rep=rep1&type=pdf.Google Scholar
  14. 14.
    Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, “Multi-probe LSH: efficient indexing for high-dimensional similarity search,” in Proceedings of the VLDB Conference, 2007, pp. 950–961. https://doi.org/www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf.Google Scholar
  15. 15.
    T. S. Teixeira, G. Teodoro, E. Valle, and J. H. Saltz, “Scalable locality-sensitive hashing for similarity search in high-dimensional, large-scale multimedia datasets,” arXiv:1310. 4136 (2013); http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdfhttps://doi.org/arxiv.org/pdf/1310.4136.pdf.Google Scholar
  16. 16.
    Z. Yang, W. T. Ooi, and Q. Sun, “Hierarchical, non-uniform locality sensitive hashing and its application to video identification,” in Proceedigns of the IEEE International Conference on Multimedia and Expo ICME, IEEE Cat. No. 04TH8763 (2004), Vol. 1, pp. 743–746. https://doi.org/www.comp.nus.edu.sg/ooiwt/papers/lsh-icme04-final.pdf. Accessed 2018.Google Scholar
  17. 17.
    V. Stegailov, N. Orekhov, and G. Smirnov, “HPC hardware efficiency for quantum and classical molecular dynamics,” in Proceedigns of the International Conference on Parallel Computing Technologies (Springer, 2015).Google Scholar
  18. 18.
    G. Smirnov and V. Stegailov, “Efficiency of classical molecular dynamics algorithms on supercomputers,” Math. Models Comput. Simul. 8, 734–743 (2016).MathSciNetCrossRefGoogle Scholar
  19. 19.
    M. Armbrust, R. Xin, C. Lian, Y. Huai, D. Liu, J. Bradley, and M. Zaharia, “Spark sql: Relational data processing in spark,” in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015, pp. 1383–1394. https://doi.org/amplab.cs.berkeley.edu/wpcontent/uploads/2015/03/SparkSQLSigmod2015.pdf. Accessed 2018.Google Scholar
  20. 20.
    A. Agarkov, T. Ismagilov, D. Makagon, A. Semenov, and A. Simonov, “Performance evaluation of the Angara interconnect,” in Proceedings of the International Conference Russian Supercomputing Days, 2016, pp. 626–639. https://doi.org/www.dislab.org/docs/rsd2016-angara-bench.pdf. Accessed 2018.Google Scholar
  21. 21.
    P. Erdős and A. Rényi, “On random graphs,” Publ. Math. Debrecen 6, 290–297 (1959). https://doi.org/snap.stanford.edu/class/cs224w-readings/erdos59random.pdf. Accessed 2018.MathSciNetzbMATHGoogle Scholar

Copyright information

© Pleiades Publishing, Ltd. 2018

Authors and Affiliations

  • A. Semenov
    • 1
  • A. Mazeev
    • 1
  • D. Doropheev
    • 2
  • T. Yusubaliev
    • 3
  1. 1.Scientific Research Centre for Electronic Computer Technology (NICEVT) JSCMoscowRussia
  2. 2.Moscow Institute of Physics and Technology (State University)Dolgoprudny, Moscow oblastRussia
  3. 3.Quality Software Solutions Ltd.MoscowRussia

Personalised recommendations