RODD: Robust Outlier Detection in Data Cubes

Kuhlmann, Lara; Wilmes, Daniel; Müller, Emmanuel; Pauly, Markus; Horn, Daniel

doi:10.1007/978-3-031-39831-5_30

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14148))

Included in the following conference series:

International Conference on Big Data Analytics and Knowledge Discovery

465 Accesses

Abstract

Data cubes are multidimensional databases, often built from several separate databases, that serve as flexible basis for data analysis. Surprisingly, outlier detection on data cubes has not yet been treated extensively. In this work, we provide the first framework to evaluate robust outlier detection methods in data cubes (RODD). We introduce a novel random forest-based outlier detection approach (RODD-RF) and compare it with more traditional methods based on robust location estimators. We propose a general type of test data and examine all methods in a simulation study. Moreover, we apply ROOD-RF to real-world data. The results show that RODD-RF leads to improved outlier detection.

L. Kuhlmann and D. Wilmes—These authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Andrews, J.T., Morton, E.J., Griffin, L.D.: Detecting anomalous data using auto-encoders. Int. J. Mach. Learn. Comput. 6(1), 21 (2016)
Google Scholar
Ardabili, S., Mosavi, A., Várkonyi-Kóczy, A.R.: Advances in machine learning modeling reviewing hybrid and ensemble methods. In: Várkonyi-Kóczy, A.R. (ed.) INTER-ACADEMIA 2019. LNNS, vol. 101, pp. 215–227. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-36841-8_21
Chapter Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Google Scholar
Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: Lof: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management Of Data, pp. 93–104 (2000)
Google Scholar
Campos, G., et al.: On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min. Knowl. Disc. 30(4), 891–927 (2016). https://doi.org/10.1007/s10618-015-0444-8
Article MathSciNet Google Scholar
Chen, J., Sathe, S., Aggarwal, C., Turaga, D.: Outlier detection with autoencoder ensembles. In: Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 90–98. SIAM (2017)
Google Scholar
Cootes, T.F., Ionita, M.C., Lindner, C., Sauer, P.: Robust and accurate shape model fitting using random forest regression voting. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7578, pp. 278–291. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33786-4_21
Chapter Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. J. Roy. Stat. Soc.: Ser. B (Methodol.) 39(1), 1–22 (1977)
MathSciNet MATH Google Scholar
El Mrabet, Z., Sugunaraj, N., Ranganathan, P., Abhyankar, S.: Random forest regressor-based approach for detecting fault location and duration in power systems. Sensors 22(2), 458 (2022)
Article Google Scholar
Friedman, J.H.: Recent advances in predictive (machine) learning. J. Classif. 23(2), 175–197 (2006)
Article MathSciNet MATH Google Scholar
Gao, J., Cheng, H., Tan, P.N.: Semi-supervised outlier detection. In: Proceedings of the 2006 ACM Symposium on Applied Computing, pp. 635–636 (2006)
Google Scholar
Gray, J., et al.: Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Disc. 1(1), 29–53 (1997)
Article Google Scholar
Hawkins, S., He, H., Williams, G., Baxter, R.: Outlier detection using replicator neural networks. In: Kambayashi, Y., Winiwarter, W., Arikawa, M. (eds.) DaWaK 2002. LNCS, vol. 2454, pp. 170–180. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-46145-0_17
Chapter Google Scholar
Hill, M., Dixon, W.: Robustness in real life: A study of clinical laboratory data. Biometrics, pp. 377–396 (1982)
Google Scholar
Hochkamp, F., Rabe, M.: Outlier detection in data mining: Exclusion of errors or loss of information? In: Hamburg International Conference of Logistics (HICL) 2022. In: Proceedings of the Hamburg International Conference of Logistics (HICL) (2022)
Google Scholar
Holst, A.: Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2025. Statista, June (2021)
Google Scholar
Huang, H., Pouls, M., Meyer, A., Pauly, M.: Travel time prediction using tree-based ensembles. In: Lalla-Ruiz, E., Mes, M., Voß, S. (eds.) ICCL 2020. LNCS, vol. 12433, pp. 412–427. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59747-4_27
Chapter Google Scholar
Jin, W., Tung, A.K.H., Han, J., Wang, W.: Ranking outliers using symmetric neighborhood relationship. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918, pp. 577–593. Springer, Heidelberg (2006). https://doi.org/10.1007/11731139_68
Chapter Google Scholar
Knorr, E.M., Ng, R.T.: A unified notion of outliers: Properties and computation. In: KDD. vol. 97, pp. 219–222 (1997)
Google Scholar
Latecki, L.J., Lazarevic, A., Pokrajac, D.: Outlier detection with kernel density functions. In: Perner, P. (ed.) MLDM 2007. LNCS (LNAI), vol. 4571, pp. 61–75. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-73499-4_6
Chapter Google Scholar
Mohandoss, D.P., Shi, Y., Suo, K.: Outlier prediction using random forest classifier. In: 2021 IEEE 11th Annual Computing and Communication Workshop and Conference (CCWC), pp. 0027–0033. IEEE (2021)
Google Scholar
Nakashima, H., Arai, I., Fujikawa, K.: Passenger counter based on random forest regressor using drive recorder and sensors in buses. In: 2019 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops), pp. 561–566. IEEE (2019)
Google Scholar
Oliver, A., Odena, A., Raffel, C.A., Cubuk, E.D., Goodfellow, I.: Realistic evaluation of deep semi-supervised learning algorithms. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Google Scholar
Park, C.M., Jeon, J.: Regression-based outlier detection of sensor measurements using independent variable synthesis. In: International Conference on Data Science. pp. 78–86. Springer (2015)
Google Scholar
Pauleen, D.J., Wang, W.Y.: Does big data mean big knowledge? km perspectives on big data and analytics. J. Knowl. Manage. 21(1) (2017)
Google Scholar
Pavlidou, M., Zioutas, G.: Kernel density outlier detector. In: Akritas, M.G., Lahiri, S.N., Politis, D.N. (eds.) Topics in Nonparametric Statistics. SPMS, vol. 74, pp. 241–250. Springer, New York (2014). https://doi.org/10.1007/978-1-4939-0569-0_22
Chapter Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Rocke, D.M., Downs, G.W., Rocke, A.J.: Are robust estimators really necessary? Technometrics 24(2), 95–101 (1982)
Article MATH Google Scholar
Ruff, L., et al.: Deep semi-supervised anomaly detection. arXiv preprint arXiv:1906.02694 (2019)
Sarawagi, S., Agrawal, R., Megiddo, N.: Discovery-driven exploration of olap data cubes. In: International Conference on Extending Database Technology. pp. 168–182. Springer (1998). https://doi.org/10.1007/bfb0100984
Searle, S.R., Gruber, M.H.: Linear models. John Wiley & Sons (2016)
Google Scholar
Shankaranarayana, S.M., Runje, D.: ALIME: autoencoder based approach for local interpretability. In: Yin, H., Camacho, D., Tino, P., Tallón-Ballesteros, A.J., Menezes, R., Allmendinger, R. (eds.) IDEAL 2019. LNCS, vol. 11871, pp. 454–463. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-33607-3_49
Chapter Google Scholar
Spjøtvoll, E., Aastveit, A.H.: Comparison of robust estimators on data from field experiments. Scandinavian J. Stat. 7, 1–13 (1980)
Google Scholar
St, L., Wold, S., et al.: Analysis of variance (anova). Chemom. Intell. Lab. Syst. 6(4), 259–272 (1989)
Article Google Scholar
Vargaftik, S., Keslassy, I., Orda, A., Ben-Itzhak, Y.: Rade: Resource-efficient supervised anomaly detection using decision tree-based ensemble methods. Mach. Learn. 110(10), 2835–2866 (2021)
Article MathSciNet MATH Google Scholar
Walfish, S.: A review of statistical outlier methods. Pharm. Technol. 30(11), 82 (2006)
Google Scholar
Wang, H., Bah, M.J., Hammad, M.: Progress in outlier detection techniques: a survey. Ieee Access 7, 107964–108000 (2019)
Article Google Scholar
Welsh, A.: The trimmed mean in the linear model. Ann. Stat. 15(1), 20–36 (1987)
MathSciNet MATH Google Scholar
Yang, X., Latecki, L.J., Pokrajac, D.: Outlier detection with globally optimal exemplar-based gmm. In: Proceedings of the 2009 SIAM International Conference on Data Mining, pp. 145–154. SIAM (2009)
Google Scholar
Zhou, C., Paffenroth, R.C.: Anomaly detection with robust deep autoencoders. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 665–674 (2017)
Google Scholar

Download references

Acknowledgements

This work was supported by the Research Center Trustworthy Data Science and Security, an institution of the University Alliance Ruhr.

Author information

Authors and Affiliations

Department of Computer Science, TU Dortmund University, Dortmund, Germany
Daniel Wilmes & Emmanuel Müller
Department of Statistics, TU Dortmund University, Dortmund, Germany
Lara Kuhlmann, Markus Pauly & Daniel Horn
Graduate School of Logistics, Department of Mechanical Engineering, TU Dortmund University, Dortmund, Germany
Lara Kuhlmann
Research Center Trustworthy Data Science and Security, TU Dortmund University, Dortmund, Germany
Emmanuel Müller, Markus Pauly & Daniel Horn

Authors

Lara Kuhlmann
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Wilmes
View author publications
You can also search for this author in PubMed Google Scholar
Emmanuel Müller
View author publications
You can also search for this author in PubMed Google Scholar
Markus Pauly
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Horn
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Lara Kuhlmann or Daniel Wilmes .

Editor information

Editors and Affiliations

Poznań University of Technology, Poznan, Poland
Robert Wrembel
Free University of Bozen-Bolzano, Bozen-Bolzano, Italy
Johann Gamper
Johannes Kepler University Linz, Linz, Austria
Gabriele Kotsis
Vienna University of Technology, Vienna, Austria
A Min Tjoa
Johannes Kepler University Linz, Linz, Austria
Ismail Khalil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kuhlmann, L., Wilmes, D., Müller, E., Pauly, M., Horn, D. (2023). RODD: Robust Outlier Detection in Data Cubes. In: Wrembel, R., Gamper, J., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2023. Lecture Notes in Computer Science, vol 14148. Springer, Cham. https://doi.org/10.1007/978-3-031-39831-5_30

Download citation

DOI: https://doi.org/10.1007/978-3-031-39831-5_30
Published: 10 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-39830-8
Online ISBN: 978-3-031-39831-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

RODD: Robust Outlier Detection in Data Cubes