Outlier Detection Forest for Large-Scale Categorical Data Sets

  • Zhipeng Sun
  • Hongwei DuEmail author
  • Qiang Ye
  • Chuang Liu
  • Patricia Lilian Kibenge
  • Hui Huang
  • Yuying Li
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11917)


Outlier detection is one of the most important data mining problems, which has attracted much attention over the past years. So far, there have been a variety of different schemes for outlier detection. However, most of the existing methods work with numeric data sets. And these methods cannot be directly applied to categorical data sets because it is not straightforward to define a practical similarity measure for categorical data. Furthermore, the existing outlier detection schemes that are tailored for categorical data tend to result in poor scalability, which makes them infeasible for large-scale data sets. In this paper, we propose a tree-based outlier detection algorithm for large-scale categorical data sets, Outlier Detection Forest (ODF). Our experimental results indicate that, compared with the state-of-the-art outlier detection schemes, ODF can achieve the same level of outlier detection precision and much better scalability.


Categorical data Outlier detection Big data Entropy 



This work was supported by National Natural Science Foundation of China No. 61772154.


  1. 1.
    Aggarwal, C.C., Yu, P.S.: Outlier detection for high dimensional data. In: ACM Sigmod Record, vol. 30, pp. 37–46. ACM (2001)CrossRefGoogle Scholar
  2. 2.
    Bache, K., Lichman, M.: UCI machine learning repository (2013)Google Scholar
  3. 3.
    Barnett, V., Lewis, T.: Outliers in Statistical Data. Wiley, New York (1994)zbMATHGoogle Scholar
  4. 4.
    Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. (CSUR) 41(3), 15 (2009)CrossRefGoogle Scholar
  5. 5.
    Hawkins, S., He, H., Williams, G., Baxter, R.: Outlier detection using replicator neural networks. In: Kambayashi, Y., Winiwarter, W., Arikawa, M. (eds.) DaWaK 2002. LNCS, vol. 2454, pp. 170–180. Springer, Heidelberg (2002). Scholar
  6. 6.
    He, Z., Deng, S., Xu, X.: An optimization model for outlier detection in categorical data. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 400–409. Springer, Heidelberg (2005). Scholar
  7. 7.
    He, Z., Deng, S., Xu, X., Huang, J.Z.: A fast greedy algorithm for outlier mining. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918, pp. 567–576. Springer, Heidelberg (2006). Scholar
  8. 8.
    Jiang, M.F., Tseng, S.S., Su, C.M.: Two-phase clustering process for outliers detection. Pattern Recogn. Lett. 22(6–7), 691–700 (2001)CrossRefGoogle Scholar
  9. 9.
    Knorr, E.M., Ng, R.T.: Finding intensional knowledge of distance-based outliers. VLDB 99, 211–222 (1999)Google Scholar
  10. 10.
    Knorr, E.M., Ng, R.T., Tucakov, V.: Distance-based outliers: algorithms and applications. VLDB J. Int. J. Very Large Data Bases 8(3–4), 237–253 (2000)CrossRefGoogle Scholar
  11. 11.
    Knox, E.M., Ng, R.T.: Algorithms for mining distancebased outliers in large datasets. In: Proceedings of the International Conference on Very Large Data Bases, pp. 392–403. Citeseer (1998)Google Scholar
  12. 12.
    Koufakou, A., Ortiz, E.G., Georgiopoulos, M., Anagnostopoulos, G.C., Reynolds, K.M.: A scalable and efficient outlier detection strategy for categorical data. In: 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007), vol. 2, pp. 210–217. IEEE (2007)Google Scholar
  13. 13.
    Li, S., Lee, R., Lang, S.D.: Mining Distance-Based Outliers from Categorical Data (2007)Google Scholar
  14. 14.
    Quinlan, J.R.: C4. 5: Programs for Machine Learning. Elsevier (2014)Google Scholar
  15. 15.
    Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: ACM Sigmod Record, vol. 29, pp. 427–438. ACM (2000)Google Scholar
  16. 16.
    Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Suri, N., Murty, M.N., Athithan, G.: A ranking-based algorithm for detection of outliers in categorical data. Int. J. Hybrid Intell. Syst. 11(1), 1–11 (2014)Google Scholar
  18. 18.
    Tang, C., Wang, S., Xu, W.: New fuzzy c-means clustering model based on the data weighted approach. Data Knowl. Eng. 69(9), 881–900 (2010)CrossRefGoogle Scholar
  19. 19.
    Williams, G., Baxter, R., He, H., Hawkins, S., Gu, L.: A comparative study of RNN for outlier detection in data mining. In: 2002 IEEE International Conference on Data Mining, 2002, Proceedings, pp. 709–712. IEEE (2002)Google Scholar
  20. 20.
    Zhao, X., Liang, J., Cao, F.: A simple and effective outlier detection algorithm for categorical data. Int. J. Mach. Learn. Cybern. 5(3), 469–477 (2014)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Zhipeng Sun
    • 1
  • Hongwei Du
    • 1
    Email author
  • Qiang Ye
    • 2
  • Chuang Liu
    • 1
  • Patricia Lilian Kibenge
    • 2
  • Hui Huang
    • 2
  • Yuying Li
    • 3
  1. 1.Department of Computer Science and TechnologyHarbin Institute of Technology (Shenzhen)ShenzhenChina
  2. 2.Faculty of Computer ScienceDalhousie UniversityHalifaxCanada
  3. 3.Department of Economics and ManagementHarbin Institute of Technology (Shenzhen)ShenzhenChina

Personalised recommendations