Outlier Detection Forest for Large-Scale Categorical Data Sets
Outlier detection is one of the most important data mining problems, which has attracted much attention over the past years. So far, there have been a variety of different schemes for outlier detection. However, most of the existing methods work with numeric data sets. And these methods cannot be directly applied to categorical data sets because it is not straightforward to define a practical similarity measure for categorical data. Furthermore, the existing outlier detection schemes that are tailored for categorical data tend to result in poor scalability, which makes them infeasible for large-scale data sets. In this paper, we propose a tree-based outlier detection algorithm for large-scale categorical data sets, Outlier Detection Forest (ODF). Our experimental results indicate that, compared with the state-of-the-art outlier detection schemes, ODF can achieve the same level of outlier detection precision and much better scalability.
KeywordsCategorical data Outlier detection Big data Entropy
This work was supported by National Natural Science Foundation of China No. 61772154.
- 2.Bache, K., Lichman, M.: UCI machine learning repository (2013)Google Scholar
- 9.Knorr, E.M., Ng, R.T.: Finding intensional knowledge of distance-based outliers. VLDB 99, 211–222 (1999)Google Scholar
- 11.Knox, E.M., Ng, R.T.: Algorithms for mining distancebased outliers in large datasets. In: Proceedings of the International Conference on Very Large Data Bases, pp. 392–403. Citeseer (1998)Google Scholar
- 12.Koufakou, A., Ortiz, E.G., Georgiopoulos, M., Anagnostopoulos, G.C., Reynolds, K.M.: A scalable and efficient outlier detection strategy for categorical data. In: 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007), vol. 2, pp. 210–217. IEEE (2007)Google Scholar
- 13.Li, S., Lee, R., Lang, S.D.: Mining Distance-Based Outliers from Categorical Data (2007)Google Scholar
- 14.Quinlan, J.R.: C4. 5: Programs for Machine Learning. Elsevier (2014)Google Scholar
- 15.Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: ACM Sigmod Record, vol. 29, pp. 427–438. ACM (2000)Google Scholar
- 17.Suri, N., Murty, M.N., Athithan, G.: A ranking-based algorithm for detection of outliers in categorical data. Int. J. Hybrid Intell. Syst. 11(1), 1–11 (2014)Google Scholar
- 19.Williams, G., Baxter, R., He, H., Hawkins, S., Gu, L.: A comparative study of RNN for outlier detection in data mining. In: 2002 IEEE International Conference on Data Mining, 2002, Proceedings, pp. 709–712. IEEE (2002)Google Scholar