Advertisement

Knowledge and Information Systems

, Volume 34, Issue 3, pp 597–618 | Cite as

SVDD-based outlier detection on uncertain data

  • Bo Liu
  • Yanshan Xiao
  • Longbing Cao
  • Zhifeng HaoEmail author
  • Feiqi Deng
Regular Paper

Abstract

Outlier detection is an important problem that has been studied within diverse research areas and application domains. Most existing methods are based on the assumption that an example can be exactly categorized as either a normal class or an outlier. However, in many real-life applications, data are uncertain in nature due to various errors or partial completeness. These data uncertainty make the detection of outliers far more difficult than it is from clearly separable data. The key challenge of handling uncertain data in outlier detection is how to reduce the impact of uncertain data on the learned distinctive classifier. This paper proposes a new SVDD-based approach to detect outliers on uncertain data. The proposed approach operates in two steps. In the first step, a pseudo-training set is generated by assigning a confidence score to each input example, which indicates the likelihood of an example tending normal class. In the second step, the generated confidence score is incorporated into the support vector data description training phase to construct a global distinctive classifier for outlier detection. In this phase, the contribution of the examples with the least confidence score on the construction of the decision boundary has been reduced. The experiments show that the proposed approach outperforms state-of-art outlier detection techniques.

Keywords

Outlier detection Data of uncertainty Support vector data description 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abraham B, Box GEP (1979) Bayesian analysis of some outlier problems in time series. Biometrika 66(2): 229–236MathSciNetzbMATHCrossRefGoogle Scholar
  2. 2.
    Agarwal C (2005) An empirical bayes approach to detect anomalies in dynamic multidimen-sional arrays. In: Proceedings of the 5th IEEE international conference on data mining. IEEE Computer Society, Washington, DC, USA, pp 26–33Google Scholar
  3. 3.
    Agarwal D (2006) Detecting anomalies in cross-classified streams: a bayesian approach. Knowl Inf Syst 11(1): 29–44CrossRefGoogle Scholar
  4. 4.
    Aggarwal C (2007) On density based transforms for uncertain data mining. In: Proceedings of IEEE international conference on data mining. IEEE Computer Society, Washington, DC, USA, pp 866–875Google Scholar
  5. 5.
    Aggarwal C (2009) Managing and mining uncertain data. Springer, BerlinzbMATHCrossRefGoogle Scholar
  6. 6.
    Aggarwal C, Yu P (2001) Outlier detection for high dimensional data. In: Proceedings of the ACM SIGMOD international conference on management of data. ACM Press, pp 37–46Google Scholar
  7. 7.
    Aggarwal C, Yu PS (2008) Outlier detection with uncertain data. In: Proceedings of SDM, pp 483–493Google Scholar
  8. 8.
    Aggarwal C, Yu PS (2009) A survey of uncertain data algorithms and applications. IEEE Trans Knowl Data Eng 21(5): 609–623CrossRefGoogle Scholar
  9. 9.
    Albrecht S, Busch J, Kloppenburg M, Metze F, Tavan P (2000) Generalized radial basis function networks for classification and novelty detection: self-organization of optional bayesian decision. Neural Netw 13(10): 1075–1093CrossRefGoogle Scholar
  10. 10.
    Barbara D, Couto J, Jajodia S, Wu N (2001a) Detecting novel network intrusions using bayes estimators. In: Proceedings of the first SIAM international conference on data miningGoogle Scholar
  11. 11.
    Barbara D, Couto J, Jajodia S, Wu N (2001b) Adam: a testbed for exploring the use of data mining in intrusion detection. SIGMOD Rec 30(4): 15–24CrossRefGoogle Scholar
  12. 12.
    Bi J, Zhang T (2004) Support vector machines with input data uncertainty. In: Proceedings of advances in neural information processing systems (NIPS)Google Scholar
  13. 13.
    Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn 30(6): 1145–1159CrossRefGoogle Scholar
  14. 14.
    Breunig M, Kriegel H, Ng R, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data (SIGMOD), pp 93–104Google Scholar
  15. 15.
    Cheng R, Kalashnikov D, Prabhakar S (2003) Evaluating probabilistic queries over imprecise data. In: Proceedings of ACM SIGMODGoogle Scholar
  16. 16.
    Chen D, Shao X, Hu B, Su Q (2005) Simultaneous wavelength selection and outlier detection in multivariate regression of near-infrared spectra. Anal Sci 21(2): 161–167zbMATHCrossRefGoogle Scholar
  17. 17.
    Cheng L, Wing HW (2001) Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. In: Proceedings of the national academy of sciences, USA (98), pp 31–36Google Scholar
  18. 18.
    Dalvi N, Suciu D (2004) Efficient query evaluation on probabilistic databases. VLDB J 16(4): 523–544CrossRefGoogle Scholar
  19. 19.
    Denton A (2009) Subspace sums for extracting non-random data from massive noise. Knowl Inf Syst 20(1): 35–62CrossRefGoogle Scholar
  20. 20.
    Eskin E (2008) Anomaly detection over noisy data using learned probability distributions. In: Proceedings of the seventeenth international conference on machine learning, pp 255–262Google Scholar
  21. 21.
    Fan HQ, Zaiane OR, Foss A (2009) Resolution-based outlier factor: detecting the top-n most outlying data points in engineering data. Knowl Inf Syst 19(1): 31–51CrossRefGoogle Scholar
  22. 22.
    Foss A, Zaiane OR (2011) Class separation through variance: a new application of outlier detection. Knowl Inf Syst 29(3): 565–596CrossRefGoogle Scholar
  23. 23.
    Guo SM, Chen LC, Tsai JSH (2009) A boundary method for outlier detection based on support vector domain description. Pattern Recogn 42(1): 77–83zbMATHCrossRefGoogle Scholar
  24. 24.
    Hido S, Tsuboi Y, Kashima H, Sugiyama M, Kanamori T (2011) Statistical outlier detection using direct density ratio estimation. Knowl Inf Syst 26(2): 309–336CrossRefGoogle Scholar
  25. 25.
    Hollier G, Austin J (2002) Novelty detection for strain-gauge degradation using maximally correlated components. In: Proceedings of the European symposium on artificial neural networks, pp 257–262Google Scholar
  26. 26.
    Huang HP, Liu YH (2002) Fuzzy support vector machine. IEEE Trans Neural Netw 13(2): 464–471CrossRefGoogle Scholar
  27. 27.
    Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, New JerseyzbMATHGoogle Scholar
  28. 28.
    Jiang SY, An QB (2008) Clustering-based outlier detection method. In: Proceedings of the fifth IEEE international conference on fuzzy systems and knowledge discovery, 429C433Google Scholar
  29. 29.
    King S, King DP, Anuzis KA, Tarassenko L, Hayton P, Utete S (2002) The use of novelty detection techniques for monitoring high-integrity plant. In: Proceedings of the 2002 international conference on control applications (1), pp 221–226Google Scholar
  30. 30.
    Kapil KG, Baikunth N, Ramamohanarao K (2010) Layered approach using conditional random fields for intrusion detection. IEEE Trans Dependable Secur Comput 7(1): 35–49CrossRefGoogle Scholar
  31. 31.
    Kriegel HP, Pfeifle M (2005) Density-based clustering of uncertain data. In: Proceedings of 11th ACM SIGKDD international conference knowledge discovery in data mining (KDD)Google Scholar
  32. 32.
    Lazarevic A, Ertoz L, Ozgur A, Srivastava J, Kumar V (2003) A comparative study of anomaly detection schemes in network intrusion detection. In: Proceedings of the third SIAM international conference on data mining (SDM), pp 23–34Google Scholar
  33. 33.
    Lee KY, Kim DW, Lee KH, Lee D (2007) Density-induced support vector data description. IEEE Trans Neural Netw 18(1): 284–289CrossRefGoogle Scholar
  34. 34.
    Mahoney MV, Chan PK (2003) Learning rules for anomaly detection of hostile net- work trafic. In: Proceedings of the 3rd IEEE international conference on data mining. IEEE Computer Society, pp 601–612Google Scholar
  35. 35.
    Matsubara Y, Sakurai Y, Yoshikawa M (2011) D-Search: an efficient and exact search algorithm for large distribution sets. Knowl Inf Syst 29(1): 131–157CrossRefGoogle Scholar
  36. 36.
    Murphy PM, Aha DW (2004) UCI repository of machine learning database. http://www.ics.uci.edu/~mlearn/MLRepository.html
  37. 37.
    Peterson GL, McBride BT (2011) The importance of generalizability for anomaly detection. Knowl Inf Syst 14(3): 377–392CrossRefGoogle Scholar
  38. 38.
    Saitoh S (1998) Theory of reproducing kernels and its applications. Longman Scientific & Technical, HarlowGoogle Scholar
  39. 39.
    Solberg HE, Lahti A (2005) Detection of outliers in reference distributions: Performance of Horn’s algorithm. Clin Chem 51(12): 2326–2332CrossRefGoogle Scholar
  40. 40.
    Shi Y, Zhang L (2011) COID: a cluster Coutlier iterative detection approach to multi-dimensional data analysis. Knowl Inf Syst 28(3): 709–733CrossRefGoogle Scholar
  41. 41.
    Sun H, Bao Y, Zhao F, Yu G, Wang D (2004) CD-trees: an efficient index structure for outlier detection. In: International conference on web-age information management (WAIM), pp 600–609Google Scholar
  42. 42.
    Tax DMJ, Ypma A, Duin RPW (1999) Support vector data description applied to machine vibration analysis. In: Proceedings of the fifth annual conference of the advanced school for computing and imaging (ASCI), 398C405Google Scholar
  43. 43.
    Tan PN, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley, BostonGoogle Scholar
  44. 44.
    Tax D, Duin R (2004) Support vector data description. Mach Learn 54(1): 45–66zbMATHCrossRefGoogle Scholar
  45. 45.
    Varun C (2008) Real-time credit card fraud detection. Expert Syst Appl 35(4): 1721–1732CrossRefGoogle Scholar
  46. 46.
    Vapnik VN (1998) The nature of statistical learning theory. Springer, LondonGoogle Scholar
  47. 47.
    Varun C, Arindam B, Vipin K (2009) Anomaly detection: a survey. ACM Comput Surv 41(3): 1–58Google Scholar
  48. 48.
    Van Hulse JD, Khoshgoftaar TM, Huang HY (2007) The pairwise attribute noise detection algorithm. Knowl Inf Syst 11(2): 171–190CrossRefGoogle Scholar
  49. 49.
    Victoria JH, Jim A (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2): 85C126Google Scholar
  50. 50.
    Wang DF, Yeung DS, Tsang ECC (2006) Structured one-class classification. IEEE Trans SMC Part B: Cybern 36(6): 1283–1295CrossRefGoogle Scholar
  51. 51.
    Williams G, Baxter R, He H, Hawkins S, Gu L (2002) A comparative study of RNN for outlier detection in data mining. In: Proceedings of the 2002 IEEE international conference on data mining. IEEE Computer Society, Washington, DC, USA, pp 709–718Google Scholar
  52. 52.
    Xiao YS et al (2009) Multi-sphere support vector data description for outliers detection on multi-distribution data. In: 2009 IEEE international conference on data mining workshops, pp 82–87Google Scholar
  53. 53.
    Yang WS, Wang SY (2008) A process-mining framework for the detection of healthcare fraud and abuse. Expert Syst Appl 31(1): 56–68CrossRefGoogle Scholar
  54. 54.
    Yang X, Latecki LJ, Pokrajac D (2009) Outlier detection with globally optimal exemplar-based GMM. In: Proceedings of the 2009 SIAM international conference on data mining (SDM), 145C154Google Scholar
  55. 55.
    Zhang Q, Li F, Yi K (2008) Finding frequent items in probabilistic data. In: Proceedings of ACM SIGMODGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2012

Authors and Affiliations

  • Bo Liu
    • 1
  • Yanshan Xiao
    • 2
  • Longbing Cao
    • 3
  • Zhifeng Hao
    • 2
    Email author
  • Feiqi Deng
    • 4
  1. 1.Faculty of AutomationGuangdong University of TechnologyGuangdongPeople’s Republic of China
  2. 2.Faculty of ComputerGuangdong University of TechnologyGuangdongPeople’s Republic of China
  3. 3.Faculty of Engineering and Information TechnologyUniversity of TechnologySydneyAustralia
  4. 4.School of Automation Science and EngineeringSouth China University of TechnologyGuangdongPeople’s Republic of China

Personalised recommendations