Skip to main content
Log in

A simple and effective outlier detection algorithm for categorical data

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Outlier detection is an important data mining task that has attracted substantial attention within diverse research communities and the areas of application. By now, many techniques have been developed to detect outliers. However, most existing research focus on numerical data. And they can not directly apply to categorical data because of the difficulty of defining a meaningful similarity measure for categorical data. In this paper, a weighted density definition is given firstly, which takes account of the density and uncertainty of objects in every attributes simultaneously. Furthermore, a simple and effective outlier detection algorithm for categorical data based on the given weighted density is proposed. The corresponding time complexity of the algorithm is analyzed as well. Experimental results on real and synthetic data sets demonstrate the effectiveness and efficiency of our proposed algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: A survey. ACM Comput Surv 41(3):Article 15

  2. Hawkins D (1980) Identification of outliers. Chapman and Hall, London

  3. Kumar V (2005) Parallel and distributed computing for cybersecurity. IEEE Distrib Syst Online 6(10). doi:10.1109/MDSO.2005.53

  4. Gamberger D, Boskovic R, Lavrac N, Groselj C (1999) Experiments with noise filtering in a medical domain. In: Proceedings of the 16th international conference on machine learning

  5. Han JW, Kamber M (2011) Data mining concepts and techniques, 3rd edn. Morgan Kaufmann Publishers Inc, San Francisco

  6. Barnett V, Lewis T (1994) Outliers in statistical data. John Wiley, Chichester

  7. Knorr E, Ng RT (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the 24th VLDB conference, New York, pp 392–403

  8. Knorr EM, Ng RT (1999) Finding intentional knowledge of distance-based outliers. In: Proceedings of 25th international conference on very large databases, Edinburgh, Scotland, pp 211–222

  9. Knorr EM, Ng RT, Tucakovand V (2000) Distance-based outliers: algorithms and applications. VLDB J 8(3–4):237–253

    Article  Google Scholar 

  10. Tang CL, Wang SG, Xu W (2010) New fuzzy c-means clustering model based on the data weighted approach. Data Knowl Eng 69:881–900

    Google Scholar 

  11. Li SX, Lee R, Lang SD (2007) Mining distance-based outliers from categorical data. In Proceedings of the 7th IEEE international conference on data mining workshops, Washington, pp 225–230

  12. He ZY, Xu XF, Huang JZ, Deng SC (2005) FP-outlier: frequent pattern based outlier detection. Comput Sci Inf Syst 2(1):103–118

    Google Scholar 

  13. Otey ME, Ghoting A, Parthasarathy S (2006) Fast distributed outlier detection in mixed-attribute data sets. Data Min Knowl Discov 12:203–228

    Article  MathSciNet  Google Scholar 

  14. He ZY, Deng SC, Xu XF (2005) An optimization model for outlier detection in categorical data. In: Proceedings of the 2005 international conference on advances in intelligent computing, Hefei, pp 400–409

  15. He ZY, Deng SC, Xu XF, Huang JZ (2006) A fast greedy algorithm for outlier mining. In: Proceedings of the 10th Pacific-Asia conference on knowledge and data discovery, pp 567–576

  16. Jiang F, Sui YF, Cao CG (2008) A rough set approach to outlier detection. Int J Gen Syst 37(5):519–536

    Article  MATH  Google Scholar 

  17. Jiang F, Sui YF, Cao CG (2009) Some issues about outlier detection in rough set theory. Expert Syst Appl 36(3):4680–4687

    Article  Google Scholar 

  18. Jiang F, Sui YF, Cao CG (2010) An information entropy-based approach to outlier detection in rough sets. Expert Syst Appl 37(9):6338C6344

    Article  Google Scholar 

  19. Cao FY, Liang JY, Bai L (2009) A new initialization method for categorical data clustering. Expert Syst Appl 36(7):10223–10228

    Article  Google Scholar 

  20. Liang X, Wei CP (2013) An Atanassov’s intuitionistic fuzzy multi-attribute group decision making method based on entropy and similarity measure. Int J Mach Learn Cybern. doi:10.1007/s13042-013-0178-0

  21. Guan PP, Yan H (2012) A hierarchical multilevel thresholding method for edge information extraction using fuzzy entropy. Int J Mach Learn Cybern 3(4):297–305

    Article  Google Scholar 

  22. Shannon CE (1948) A mathematical theory of communiction. Bell Syst Tech J 27(3–4):379–423

    Article  MATH  MathSciNet  Google Scholar 

  23. Liang JY, Chin KS, Dang CY (2002) A new method for measuring uncertainty and fuzziness in rough set theory. Int J Gen Syst 31(4):331–342

    Article  MATH  MathSciNet  Google Scholar 

  24. Liang JY, Zhao XW, Li DY, Cao FY, Dang CY (2012) Determining the number of clusters using information entropy for mixed data. Pattern Recognit 45(6):2251-2265

    Article  MATH  Google Scholar 

  25. Cao FY, Liang JY, Li DY, Zhao XW (2013) A weighting k-modes algorithm for subspace clustering of categorical data. Neurocomputing 108:23–30

    Article  Google Scholar 

  26. Qian YH, Liang JY, Pedrycz W, Dang CY (2010) Positive approximation: an accelerator for attribute reduction in rough set theory. Artif Intell 174(9-10):597–618

    Article  MATH  MathSciNet  Google Scholar 

  27. Liang JY, Wang F, Dang CY, Qian YH (2012) A group incremental approach to feature selection applying rough set technique. IEEE Trans Knowl Data Eng. doi:10.1109/TKDE.2012.146

  28. Qian YH, Liang JY, Li DY, Zhang HY, Dang CY (2008) Measures for evaluating the decision performance of a decision table in rough set theory. Inf Sci 8(1):181–202

    Google Scholar 

  29. Liang JY, Shi ZZ, Li DY, Wierman MJ (2006) The information entropy, rough entropy and knowledge granulation in incomplete information system. Int J Gen Syst 35(6):641–654

    Article  MATH  MathSciNet  Google Scholar 

  30. Xu ZY, Liu ZP, Yang BR, Song W (2006) A quick attribute reduction algorithm with complexity of max(O(|C||U|), O(|C|2|U/C|)). Chin J Comput 29(3):391–398

    Google Scholar 

  31. UCI Machine Learning Repository 2012 http://archive.ics.uci.edu/ml/datasets.html

  32. Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: Proceedings of the 2001 ACM SIGMOD international conference on managment of data, California, pp 37–46

  33. Hawkins S, He HX, Williams G, Baxter R (2002) Outlier detection using replicator neural networks. In: Proceedings of the 5th international conference and data warehousing and knowledge discovery

  34. Cristofor D, Simovici D (2002) Finding median partitions using information-theoretical algorithms. J Univers Comput Sci 8(2):153–172 (software at http://www.cs.umb.edu/~dana/GAClust/index.html)

    Google Scholar 

Download references

Acknowledgements

The authors are very grateful to the anonymous reviewers and editor. Their many helpful and constructive comments and suggestions helped us significantly improve this work. This work was supported by the National Natural Science Foundation of China (No. 71031006), the Foundation of Doctoral Program Research of Ministry of Education of China (No. 20101401110002), the Construction Project of the Science and Technology Basic Condition Platform of Shanxi Province (No. 2012091002-0101) and Shanxi Scholarship Council of China (No. 2013-101).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xingwang Zhao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, X., Liang, J. & Cao, F. A simple and effective outlier detection algorithm for categorical data. Int. J. Mach. Learn. & Cyber. 5, 469–477 (2014). https://doi.org/10.1007/s13042-013-0202-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-013-0202-4

Keywords

Navigation