Skip to main content

RODD: Robust Outlier Detection in Data Cubes

  • Conference paper
  • First Online:
Big Data Analytics and Knowledge Discovery (DaWaK 2023)

Abstract

Data cubes are multidimensional databases, often built from several separate databases, that serve as flexible basis for data analysis. Surprisingly, outlier detection on data cubes has not yet been treated extensively. In this work, we provide the first framework to evaluate robust outlier detection methods in data cubes (RODD). We introduce a novel random forest-based outlier detection approach (RODD-RF) and compare it with more traditional methods based on robust location estimators. We propose a general type of test data and examine all methods in a simulation study. Moreover, we apply ROOD-RF to real-world data. The results show that RODD-RF leads to improved outlier detection.

L. Kuhlmann and D. Wilmes—These authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Andrews, J.T., Morton, E.J., Griffin, L.D.: Detecting anomalous data using auto-encoders. Int. J. Mach. Learn. Comput. 6(1), 21 (2016)

    Google Scholar 

  2. Ardabili, S., Mosavi, A., Várkonyi-Kóczy, A.R.: Advances in machine learning modeling reviewing hybrid and ensemble methods. In: Várkonyi-Kóczy, A.R. (ed.) INTER-ACADEMIA 2019. LNNS, vol. 101, pp. 215–227. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-36841-8_21

    Chapter  Google Scholar 

  3. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Google Scholar 

  4. Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: Lof: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management Of Data, pp. 93–104 (2000)

    Google Scholar 

  5. Campos, G., et al.: On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min. Knowl. Disc. 30(4), 891–927 (2016). https://doi.org/10.1007/s10618-015-0444-8

    Article  MathSciNet  Google Scholar 

  6. Chen, J., Sathe, S., Aggarwal, C., Turaga, D.: Outlier detection with autoencoder ensembles. In: Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 90–98. SIAM (2017)

    Google Scholar 

  7. Cootes, T.F., Ionita, M.C., Lindner, C., Sauer, P.: Robust and accurate shape model fitting using random forest regression voting. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7578, pp. 278–291. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33786-4_21

    Chapter  Google Scholar 

  8. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. J. Roy. Stat. Soc.: Ser. B (Methodol.) 39(1), 1–22 (1977)

    MathSciNet  MATH  Google Scholar 

  9. El Mrabet, Z., Sugunaraj, N., Ranganathan, P., Abhyankar, S.: Random forest regressor-based approach for detecting fault location and duration in power systems. Sensors 22(2), 458 (2022)

    Article  Google Scholar 

  10. Friedman, J.H.: Recent advances in predictive (machine) learning. J. Classif. 23(2), 175–197 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  11. Gao, J., Cheng, H., Tan, P.N.: Semi-supervised outlier detection. In: Proceedings of the 2006 ACM Symposium on Applied Computing, pp. 635–636 (2006)

    Google Scholar 

  12. Gray, J., et al.: Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Disc. 1(1), 29–53 (1997)

    Article  Google Scholar 

  13. Hawkins, S., He, H., Williams, G., Baxter, R.: Outlier detection using replicator neural networks. In: Kambayashi, Y., Winiwarter, W., Arikawa, M. (eds.) DaWaK 2002. LNCS, vol. 2454, pp. 170–180. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-46145-0_17

    Chapter  Google Scholar 

  14. Hill, M., Dixon, W.: Robustness in real life: A study of clinical laboratory data. Biometrics, pp. 377–396 (1982)

    Google Scholar 

  15. Hochkamp, F., Rabe, M.: Outlier detection in data mining: Exclusion of errors or loss of information? In: Hamburg International Conference of Logistics (HICL) 2022. In: Proceedings of the Hamburg International Conference of Logistics (HICL) (2022)

    Google Scholar 

  16. Holst, A.: Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2025. Statista, June (2021)

    Google Scholar 

  17. Huang, H., Pouls, M., Meyer, A., Pauly, M.: Travel time prediction using tree-based ensembles. In: Lalla-Ruiz, E., Mes, M., Voß, S. (eds.) ICCL 2020. LNCS, vol. 12433, pp. 412–427. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59747-4_27

    Chapter  Google Scholar 

  18. Jin, W., Tung, A.K.H., Han, J., Wang, W.: Ranking outliers using symmetric neighborhood relationship. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918, pp. 577–593. Springer, Heidelberg (2006). https://doi.org/10.1007/11731139_68

    Chapter  Google Scholar 

  19. Knorr, E.M., Ng, R.T.: A unified notion of outliers: Properties and computation. In: KDD. vol. 97, pp. 219–222 (1997)

    Google Scholar 

  20. Latecki, L.J., Lazarevic, A., Pokrajac, D.: Outlier detection with kernel density functions. In: Perner, P. (ed.) MLDM 2007. LNCS (LNAI), vol. 4571, pp. 61–75. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-73499-4_6

    Chapter  Google Scholar 

  21. Mohandoss, D.P., Shi, Y., Suo, K.: Outlier prediction using random forest classifier. In: 2021 IEEE 11th Annual Computing and Communication Workshop and Conference (CCWC), pp. 0027–0033. IEEE (2021)

    Google Scholar 

  22. Nakashima, H., Arai, I., Fujikawa, K.: Passenger counter based on random forest regressor using drive recorder and sensors in buses. In: 2019 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops), pp. 561–566. IEEE (2019)

    Google Scholar 

  23. Oliver, A., Odena, A., Raffel, C.A., Cubuk, E.D., Goodfellow, I.: Realistic evaluation of deep semi-supervised learning algorithms. In: Advances in Neural Information Processing Systems, vol. 31 (2018)

    Google Scholar 

  24. Park, C.M., Jeon, J.: Regression-based outlier detection of sensor measurements using independent variable synthesis. In: International Conference on Data Science. pp. 78–86. Springer (2015)

    Google Scholar 

  25. Pauleen, D.J., Wang, W.Y.: Does big data mean big knowledge? km perspectives on big data and analytics. J. Knowl. Manage. 21(1) (2017)

    Google Scholar 

  26. Pavlidou, M., Zioutas, G.: Kernel density outlier detector. In: Akritas, M.G., Lahiri, S.N., Politis, D.N. (eds.) Topics in Nonparametric Statistics. SPMS, vol. 74, pp. 241–250. Springer, New York (2014). https://doi.org/10.1007/978-1-4939-0569-0_22

    Chapter  Google Scholar 

  27. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  28. Rocke, D.M., Downs, G.W., Rocke, A.J.: Are robust estimators really necessary? Technometrics 24(2), 95–101 (1982)

    Article  MATH  Google Scholar 

  29. Ruff, L., et al.: Deep semi-supervised anomaly detection. arXiv preprint arXiv:1906.02694 (2019)

  30. Sarawagi, S., Agrawal, R., Megiddo, N.: Discovery-driven exploration of olap data cubes. In: International Conference on Extending Database Technology. pp. 168–182. Springer (1998). https://doi.org/10.1007/bfb0100984

  31. Searle, S.R., Gruber, M.H.: Linear models. John Wiley & Sons (2016)

    Google Scholar 

  32. Shankaranarayana, S.M., Runje, D.: ALIME: autoencoder based approach for local interpretability. In: Yin, H., Camacho, D., Tino, P., Tallón-Ballesteros, A.J., Menezes, R., Allmendinger, R. (eds.) IDEAL 2019. LNCS, vol. 11871, pp. 454–463. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-33607-3_49

    Chapter  Google Scholar 

  33. Spjøtvoll, E., Aastveit, A.H.: Comparison of robust estimators on data from field experiments. Scandinavian J. Stat. 7, 1–13 (1980)

    Google Scholar 

  34. St, L., Wold, S., et al.: Analysis of variance (anova). Chemom. Intell. Lab. Syst. 6(4), 259–272 (1989)

    Article  Google Scholar 

  35. Vargaftik, S., Keslassy, I., Orda, A., Ben-Itzhak, Y.: Rade: Resource-efficient supervised anomaly detection using decision tree-based ensemble methods. Mach. Learn. 110(10), 2835–2866 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  36. Walfish, S.: A review of statistical outlier methods. Pharm. Technol. 30(11), 82 (2006)

    Google Scholar 

  37. Wang, H., Bah, M.J., Hammad, M.: Progress in outlier detection techniques: a survey. Ieee Access 7, 107964–108000 (2019)

    Article  Google Scholar 

  38. Welsh, A.: The trimmed mean in the linear model. Ann. Stat. 15(1), 20–36 (1987)

    MathSciNet  MATH  Google Scholar 

  39. Yang, X., Latecki, L.J., Pokrajac, D.: Outlier detection with globally optimal exemplar-based gmm. In: Proceedings of the 2009 SIAM International Conference on Data Mining, pp. 145–154. SIAM (2009)

    Google Scholar 

  40. Zhou, C., Paffenroth, R.C.: Anomaly detection with robust deep autoencoders. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 665–674 (2017)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the Research Center Trustworthy Data Science and Security, an institution of the University Alliance Ruhr.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Lara Kuhlmann or Daniel Wilmes .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kuhlmann, L., Wilmes, D., Müller, E., Pauly, M., Horn, D. (2023). RODD: Robust Outlier Detection in Data Cubes. In: Wrembel, R., Gamper, J., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2023. Lecture Notes in Computer Science, vol 14148. Springer, Cham. https://doi.org/10.1007/978-3-031-39831-5_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-39831-5_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-39830-8

  • Online ISBN: 978-3-031-39831-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics