Skip to main content
Log in

Adaptivity in continuous massively parallel distance-based outlier detection

  • Regular Paper
  • Published:
Computing Aims and scope Submit manuscript

Abstract

We deal with the problem of dynamically allocating the workload to multiple workers in massively parallel continuous distance-based outlier detection, where the workload is conceptually split in contiguous overlapping regions. The main challenges stem from the fact that modern streaming processing frameworks, such as Apache Flink and Spark Streaming, do not support feedback loops, the process is stateful while the adaptations do not result in key redistribution but in modifying the region boundaries associated with each key. These challenges correspond to overlooked issues, which call for novel solutions that we provide in our work. More specifically, firstly, we propose an architecture for allowing such adaptations in Flink. Secondly, we propose specific techniques for adaptive region definition that are applicable to any distance metric. Finally, we conduct thorough experimental evaluation and our results show that our proposal is both efficient and effective even in small finite streams. In addition, our proposal is shown to be insensitive to the exact continuous outlier detection algorithm and outlier query parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Availability of data and material (data transparency)

all datasets used are publicly available from third-part repositories.

Notes

  1. An early short version of this work has appeared in [29], which introduced the technique that is termed as naive in this work and was tailored for the Euclidean space and evaluated using only single-dimensional numerical datasets. We significantly extend and improve upon this early work through proposing and experimenting with more eager and sophisticated techniques, while supporting arbitrary metric distances and evaluating using both numeric and text datasets with a high number of dimensions.

  2. The implementation of the whole framework along with the techniques in this work is publicly available from https://github.com/tatoliop/PROUD-PaRallel-OUtlier-Detection-for-streams/tree/adaptive_partitioning.

  3. https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/side_output.html.

  4. We have experimented with additional parameters (slide sizes of 1% up to 50% of the window size) and the results are similar to the ones to be presented; due to space constraints such experiments are omitted.

  5. To further investigate any possible correlation we have run some tests using the MMPC algorithm [6] after transforming the runtime values to a binary target variable (improvement or not-improvement); this has also not yielded any concrete results.

References

  1. Abdelhamid AS, Mahmood AR, Daghistani A, Aref WG (2020) Prompt: Dynamic data-partitioning for distributed micro-batch stream processing systems. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp 2455–2469

  2. Aly AM, Mahmood AR, Hassan MS, Aref WG, Ouzzani M, Elmeleegy H, Qadah T (2015) AQWA: adaptive query-workload-aware partitioning of big spatial data. PVLDB 8(13):2062–2073

    Google Scholar 

  3. Angiulli F, Fassetti F (2007) Detecting distance-based outliers in streams of data. In: CIKM, pp 811–820

  4. Balkesen C, Tatbul N (2011) Scalable data partitioning techniques for parallel sliding window processing over data streams. In: International Workshop on Data Management for Sensor Networks (DMSN)

  5. Bellas C, Gounaris A (2020) An empirical evaluation of exact set similarity join techniques using gpus. Inf Syst 89:101485. https://doi.org/10.1016/j.is.2019.101485

    Article  Google Scholar 

  6. Brown LE, Tsamardinos I, Aliferis CF (2004) A novel algorithm for scalable and accurate bayesian network learning. In: Fieschi M, Coiera EW, Li JY (eds) MEDINFO 2004 - Proceedings of the 11th World Congress on Medical Informatics, San Francisco, California, USA, September 7-11, 2004, Studies in Health Technology and Informatics, vol 107, pp 711–715

  7. Cao L, Wang J, Rundensteiner EA (2016) Sharing-aware outlier analytics over high-volume data streams. In: ICDM, pp 527–540. ACM

  8. Cao L, Yan Y, Kuhlman C, Wang Q, Rundensteiner EA, Eltabakh MY (2017) Multi-tactic distance-based outlier detection. In: ICDE, pp 959–970

  9. Cao L, Yang D, Wang Q, Yu Y, Wang J, Rundensteiner EA (2014) Scalable distance-based outlier detection over high-volume data streams. In: ICDE, pp 76–87

  10. Carbone P, Ewen S, Fóra G, Haridi S, Richter S, Tzoumas K (2017) State management in apache flink®: Consistent stateful distributed stream processing. PVLDB 10(12):1718–1729

    Google Scholar 

  11. Cordova I, Moh T (2015) DBSCAN on resilient distributed datasets. In: 2015 International Conference on High Performance Computing & Simulation, HPCS, pp 531–540

  12. Deshpande A, Ives ZG, Raman V (2007) Adaptive query processing. Found. Trends Databases 1(1):1–140

    Article  MATH  Google Scholar 

  13. Ding M, Chen S (2019) Efficient partitioning and query processing of spatio-temporal graphs with trillion edges. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp 1714–1717. IEEE

  14. Gedik B (2014) Partitioning functions for stateful data parallelism in stream processing. VLDB J 23(4):517–539

    Article  Google Scholar 

  15. Gill G, Dathathri R, Hoang L, Pingali K (2018) A study of partitioning policies for graph analytics on large-scale distributed platforms. Proceedings of the VLDB Endowment 12(4):321–334

    Article  Google Scholar 

  16. Gounaris A, Yfoulis CA, Paton NW (2012) Efficient load balancing in partitioned queries under random perturbations. TAAS 7(1):5:1-5:27

    Article  Google Scholar 

  17. Katsipoulakis NR, Labrinidis A, Chrysanthis PK (2017) A holistic view of stream partitioning costs. PVLDB 10(11):1286–1297

    Google Scholar 

  18. Knorr EM, Ng RT, Tucakov V (2000) Distance-based outliers: Algorithms and applications. VLDB J 8(3–4):237–253

    Article  Google Scholar 

  19. Kontaki M, Gounaris A, Papadopoulos AN, Tsichlas K, Manolopoulos Y (2016) Efficient and flexible algorithms for monitoring distance-based outliers over data streams. Inf Syst 55:37–53

    Article  Google Scholar 

  20. Monte BD, Zeuch S, Rabl T, Markl V (2020) Rhino: Efficient management of very large distributed state for stream processing engines. In: Proceedings of the 2020 International Conference on Management of Data, SIGMOD, pp 2471–2486

  21. Rupprecht L, Culhane W, Pietzuch PR (2017) Squirreljoin: Network-aware distributed join processing with lazy partitioning. PVLDB 10(11):1250–1261

    Google Scholar 

  22. Shah MA, Hellerstein JM, Chandrasekaran S, Franklin MJ (2002) Flux: An adaptive partitioning operator for continuous query systems. In: Dayal U, Ramamritham K, Vijayaraman TM (eds) ICDE, pp 25–36 (2002)

  23. Song H, Lee J (2018) RP-DBSCAN: A superfast parallel DBSCAN algorithm based on random partitioning. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD, pp 1173–1187

  24. Su L, Han W, Yang S, Zou P, Jia Y (2007) Continuous adaptive outlier detection on distributed data streams. In: International Conference on High Performance Computing and Communications, pp 74–85

  25. Subramaniam S, Palpanas T, Papadopoulos D, Kalogeraki V, Gunopulos D (2006) Online outlier detection in sensor data using non-parametric models. In: VLDB, pp 187–198

  26. Tang M, Yu Y, Malluhi QM, Ouzzani M, Aref WG (2016) Locationspark: A distributed in-memory data management system for big spatial data. PVLDB 9(13):1565–1568

    Google Scholar 

  27. To Q, Soto J, Markl V (2018) A survey of state management in big data processing systems. VLDB J 27(6):847–872

    Article  Google Scholar 

  28. Toliopoulos T, Bellas C, Gounaris A, Papadopoulos A (2020) PROUD: parallel outlier detection for streams. In: SIGMOD (demo track, to appear)

  29. Toliopoulos T, Gounaris A (2020) Adaptive distributed partitioning in apache flink. In: 36th IEEE International Conference on Data Engineering Workshops, ICDE Workshops 2020, Dallas, TX, USA, April 20-24, 2020, pp 127–132. IEEE

  30. Toliopoulos T, Gounaris A, Tsichlas K, Papadopoulos A, Sampaio S (2020) Continuous outlier mining of streaming data in flink. Inf Syst 93:101569

    Article  Google Scholar 

  31. Tran L, Fan L, Shahabi C (2016) Distance-based outlier detection in data streams. PVLDB 9(12):1089–1100

    Google Scholar 

  32. Tran L, Mun M, Shahabi C (2020) Real-time distance-based outlier detection in data streams. PVLDB 14(2):141–153

    Google Scholar 

  33. Yang D, Rundensteiner E, Ward M (2009) Neighbor-based pattern detection for windows over streaming data. In: EDBT, pp 529–540

  34. Yang, K., Gao, Y., Ma, R., Chen, L., Wu, S., Chen, G.: Dbscan-ms: Distributed density-based clustering in metric spaces. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1346–1357. IEEE (2019)

  35. Yianilos PN (1993) Data structures and algorithms for nearest neighbor search in general metric spaces. In: SODA, vol 93, pp 311–321

  36. Yoon S, Lee J, Lee BS (2019) NETS: extremely fast outlier detection from a data stream via set-based processing. PVLDB 12(11):1303–1315

    Google Scholar 

  37. Zhao G, Yu Y, Song P, Zhao G, Ji Z (2018) A parameter space framework for online outlier detection over high-volume data streams. IEEE Access 6:38124–38136

    Article  Google Scholar 

Download references

Acknowledgements

This research work has been supported by the European Commission under the Horizon 2020 Programme, through funding of the LifeChamps project (Grant 875329).

Funding

European Commission under the Horizon 2020 Programme, LifeChamps project (Grant 875329).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Theodoros Toliopoulos.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Code availability (software application or custom code)

all code is available from https://github.com/tatoliop/PROUD-PaRallel-OUtlier-Detection-for-streams/tree/adaptive_partitioning.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Toliopoulos, T., Gounaris, A. Adaptivity in continuous massively parallel distance-based outlier detection. Computing 104, 2659–2684 (2022). https://doi.org/10.1007/s00607-022-01101-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-022-01101-5

Keywords

Mathematics Subject Classification

Navigation