Skip to main content

Outlier Detection Forest for Large-Scale Categorical Data Sets

  • Conference paper
  • First Online:
Computational Data and Social Networks (CSoNet 2019)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11917))

Included in the following conference series:

Abstract

Outlier detection is one of the most important data mining problems, which has attracted much attention over the past years. So far, there have been a variety of different schemes for outlier detection. However, most of the existing methods work with numeric data sets. And these methods cannot be directly applied to categorical data sets because it is not straightforward to define a practical similarity measure for categorical data. Furthermore, the existing outlier detection schemes that are tailored for categorical data tend to result in poor scalability, which makes them infeasible for large-scale data sets. In this paper, we propose a tree-based outlier detection algorithm for large-scale categorical data sets, Outlier Detection Forest (ODF). Our experimental results indicate that, compared with the state-of-the-art outlier detection schemes, ODF can achieve the same level of outlier detection precision and much better scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aggarwal, C.C., Yu, P.S.: Outlier detection for high dimensional data. In: ACM Sigmod Record, vol. 30, pp. 37–46. ACM (2001)

    Article  Google Scholar 

  2. Bache, K., Lichman, M.: UCI machine learning repository (2013)

    Google Scholar 

  3. Barnett, V., Lewis, T.: Outliers in Statistical Data. Wiley, New York (1994)

    MATH  Google Scholar 

  4. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. (CSUR) 41(3), 15 (2009)

    Article  Google Scholar 

  5. Hawkins, S., He, H., Williams, G., Baxter, R.: Outlier detection using replicator neural networks. In: Kambayashi, Y., Winiwarter, W., Arikawa, M. (eds.) DaWaK 2002. LNCS, vol. 2454, pp. 170–180. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-46145-0_17

    Chapter  Google Scholar 

  6. He, Z., Deng, S., Xu, X.: An optimization model for outlier detection in categorical data. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 400–409. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_42

    Chapter  Google Scholar 

  7. He, Z., Deng, S., Xu, X., Huang, J.Z.: A fast greedy algorithm for outlier mining. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918, pp. 567–576. Springer, Heidelberg (2006). https://doi.org/10.1007/11731139_67

    Chapter  Google Scholar 

  8. Jiang, M.F., Tseng, S.S., Su, C.M.: Two-phase clustering process for outliers detection. Pattern Recogn. Lett. 22(6–7), 691–700 (2001)

    Article  Google Scholar 

  9. Knorr, E.M., Ng, R.T.: Finding intensional knowledge of distance-based outliers. VLDB 99, 211–222 (1999)

    Google Scholar 

  10. Knorr, E.M., Ng, R.T., Tucakov, V.: Distance-based outliers: algorithms and applications. VLDB J. Int. J. Very Large Data Bases 8(3–4), 237–253 (2000)

    Article  Google Scholar 

  11. Knox, E.M., Ng, R.T.: Algorithms for mining distancebased outliers in large datasets. In: Proceedings of the International Conference on Very Large Data Bases, pp. 392–403. Citeseer (1998)

    Google Scholar 

  12. Koufakou, A., Ortiz, E.G., Georgiopoulos, M., Anagnostopoulos, G.C., Reynolds, K.M.: A scalable and efficient outlier detection strategy for categorical data. In: 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007), vol. 2, pp. 210–217. IEEE (2007)

    Google Scholar 

  13. Li, S., Lee, R., Lang, S.D.: Mining Distance-Based Outliers from Categorical Data (2007)

    Google Scholar 

  14. Quinlan, J.R.: C4. 5: Programs for Machine Learning. Elsevier (2014)

    Google Scholar 

  15. Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: ACM Sigmod Record, vol. 29, pp. 427–438. ACM (2000)

    Google Scholar 

  16. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948)

    Article  MathSciNet  Google Scholar 

  17. Suri, N., Murty, M.N., Athithan, G.: A ranking-based algorithm for detection of outliers in categorical data. Int. J. Hybrid Intell. Syst. 11(1), 1–11 (2014)

    Google Scholar 

  18. Tang, C., Wang, S., Xu, W.: New fuzzy c-means clustering model based on the data weighted approach. Data Knowl. Eng. 69(9), 881–900 (2010)

    Article  Google Scholar 

  19. Williams, G., Baxter, R., He, H., Hawkins, S., Gu, L.: A comparative study of RNN for outlier detection in data mining. In: 2002 IEEE International Conference on Data Mining, 2002, Proceedings, pp. 709–712. IEEE (2002)

    Google Scholar 

  20. Zhao, X., Liang, J., Cao, F.: A simple and effective outlier detection algorithm for categorical data. Int. J. Mach. Learn. Cybern. 5(3), 469–477 (2014)

    Article  Google Scholar 

Download references

Acknowledgement

This work was supported by National Natural Science Foundation of China No. 61772154.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongwei Du .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sun, Z. et al. (2019). Outlier Detection Forest for Large-Scale Categorical Data Sets. In: Tagarelli, A., Tong, H. (eds) Computational Data and Social Networks. CSoNet 2019. Lecture Notes in Computer Science(), vol 11917. Springer, Cham. https://doi.org/10.1007/978-3-030-34980-6_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-34980-6_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-34979-0

  • Online ISBN: 978-3-030-34980-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics