Abstract
We propose a novel method for outlier detection using binary decision diagrams. Leave-one-out density is proposed as a new measure for detecting outliers, which is defined as a ratio of the number of data elements inside a region to the volume of the region after a focused datum is removed. We show that leave-one-out density can be evaluated very efficiently on a set of regions around each datum in a given dataset by using binary decision diagrams. The time complexity of the proposed method is nearly linear with respect to the size of the dataset, while the outlier detection accuracy is still comparable to that of other methods. Experimental results show the effectiveness of the proposed method.
Similar content being viewed by others
Notes
Smoothing a Boolean function f with respect to x means \((f \wedge x) \vee (f \wedge \overline{x})\).
References
Aryal S, Ting KM, Wells JR, Washio T (2014) Improving iforest with relative mass. In: Tseng VS, Ho TB, Zhou ZH, Chen AL, Kao HY (eds) Advances in knowledge discovery and data mining. Lecture notes in computer science. Springer, New York, pp 510–521
Bache K, Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 24 June 2014
Bay SD (2003) Orca: a program for mining distance-based outliers. http://www.stephenbay.net/orca. Accessed 6 Jul 2015
Bay SD, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’03), ACM, New York, pp 29–38
Beckmann N, Kriegel H, Schneider R, Seeger B (1990) The R*-tree: an efficient and robust access method for points and rectangles. SIGMOD Rec 19(2):322–331
Blackard JA, Dean DJ (1999) Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Comput Electron Agric 24(3):131–151
Brace K, Rudell R, Bryant R (1990) Efficient implementation of a BDD package. In: The 27th ACM/IEEE design automation conference, pp 40–45
Breunig MM, Kriegel HP, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data (SIGMOD ’00), ACM, New York, pp 93–104
Bryant R (1986) Graph-based algorithms for boolean function manipulation. IEEE Trans Comput 35(8):677–691
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15:1–15:58
Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27(8):861–874
Ghoting A, Parthasarathy S, Otey ME (2008) Fast mining of distance-based outliers in high-dimensional datasets. Data Min Knowl Discov 16(3):349–364
Karatzoglou A, Smola A, Hornik K, Zeileis A (2004) kernlab-an S4 package for kernel methods in R. J Stat Softw 11(9):1–20
Kutsuna T (2010) A binary decision diagram-based one-class classifier. In: Proceedings of the 10th IEEE international conference on data mining (ICDM ’10), pp 284–293
Kutsuna T, Yamamoto A (2014a) Outlier detection based on leave-one-out density using binary decision diagrams. In: Tseng V, Ho T, Zhou ZH, Chen A, Kao HY (eds) Advances in knowledge discovery and data mining. Lecture notes in computer science. Springer, New York, pp 486–497
Kutsuna T, Yamamoto A (2014b) A parameter-free approach for one-class classification using binary decision diagrams. Intell Data Anal 18(5):889–910
Lazarevic A, Kumar V (2005) Feature bagging for outlier detection. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining (KDD ’05), ACM, New York, pp 157–166
Lazarevic A, Ertoz L, Kumar V, Ozgur A, Srivastava J (2003) A comparative study of anomaly detection schemes in network intrusion detection. In: Proceedings of the 2003 SIAM conference on data mining
Liu FT (2009) Isolationforest: Isolation forest. http://sourceforge.net/projects/iforest. Accessed 11 November 2014. R package version 0.0-25
Liu FT, Ting KM, Zhou ZH (2008) Isolation forest. In: Proceedings of the 8th IEEE international conference on data mining (ICDM ’08), pp 413–422
Mahalanobis P (1936) On the generalized distance in statistics. Proc Natl Inst Sci (Calcutta) 2:49–55
Moro S, Cortez P, Rita P (2014) A data-driven approach to predict the success of bank telemarketing. Decis Support Syst 62:22–31
R Core Team (2014) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. http://www.R-project.org. Accessed 20 Jan 2016
Schölkopf B, Platt J, Shawe-Taylor J, Smola A, Williamson R (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1471
Somenzi F (1999) Calculational system design. In: Broy M, Steninbruggen R (eds) Binary decision diagrams, vol 173. IOS Press, Amsterdam, pp 303–366
Somenzi F (2012) CUDD: CU decision diagram package. http://vlsi.colorado.edu/~fabio/CUDD. Accessed 24 June 2014
Torgo L (2010) Data mining with R, learning with case studies. Chapman and Hall/CRC, Boca Raton
Yamanishi K, Takeuchi JI, Williams G, Milne P (2004) On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. Data Min Knowl Discov 8(3):275–300
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Srinivasan Parthasarathy
Rights and permissions
About this article
Cite this article
Kutsuna, T., Yamamoto, A. Outlier detection using binary decision diagrams. Data Min Knowl Disc 31, 548–572 (2017). https://doi.org/10.1007/s10618-016-0486-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-016-0486-6