Outlier Detection with Arbitrary Probability Functions

  • Fabrizio Angiulli
  • Fabio Fassetti
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8249)

Abstract

We consider the problem of unsupervised outlier detection in large collections of data objects when objects are modeled by means of arbitrary multidimensional probability density functions. Specifically, we present a novel definition of outlier in the context of uncertain data under the attribute level uncertainty model, according to which an uncertain object is an object that always exists but its actual value is modeled by a multivariate pdf. The notion of outlier provided is distance-based, in that an uncertain object is declared to be an outlier on the basis of the expected number of its neighbors in the data set. To the best of our knowledge this is the first work that considers the unsupervised outlier detection problem on the full feature space on data objects modeled by means of arbitrarily shaped multidimensional distribution functions. Properties that allow to reduce the number of probability distance computations are presented, together with an efficient algorithm for determining the outliers in an input uncertain data set.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Lindley, D.: Understanding Uncertainty. Wiley-Interscience (2006)Google Scholar
  2. 2.
    Aggarwal, C., Yu, P.: A survey of uncertain data algorithms and applications. IEEE Trans. Knowl. Data Eng. 21(5), 609–623 (2009)CrossRefGoogle Scholar
  3. 3.
    Mohri, M.: Learning from uncertain data. In: Schölkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS (LNAI), vol. 2777, pp. 656–670. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  4. 4.
    Ngai, W., Kao, B., Chui, C., Cheng, R., Chau, M., Yip, K.: Efficient clustering of uncertain data. In: Proc. Int. Conf. on Data Mining (ICDM), pp. 436–445 (2006)Google Scholar
  5. 5.
    Kriegel, H.P., Pfeifle, M.: Density-based clustering of uncertain data. In: Proc. Int. Conf. on Knowledge Discovery and Data Mining (KDD), pp. 672–677 (2005)Google Scholar
  6. 6.
    Ren, J., Lee, S., Chen, X., Kao, B., Cheng, R., Cheung, D.: Naive bayes classification of uncertain data. In: Proc. Int. Conf. on Data Mining (ICDM), pp. 944–949 (2009)Google Scholar
  7. 7.
    Bi, J., Zhang, T.: Support vector classification with input data uncertainty. In: Proc. Conf. on Neural Information Processing Systems (NIPS), pp. 161–168 (2004)Google Scholar
  8. 8.
    Aggarwal, C., Yu, P.: Outlier detection with uncertain data. In: Proc. Int. Conf. on Data Mining (SDM), pp. 483–493 (2008)Google Scholar
  9. 9.
    Green, T., Tannen, V.: Models for incomplete and probabilistic information. IEEE Data Eng. Bull. 29(1), 17–24 (2006)Google Scholar
  10. 10.
    Hawkins, D.: Identification of Outliers. Monographs on Applied Probability and Statistics. Chapman & Hall (May 1980)Google Scholar
  11. 11.
    Knorr, E., Ng, R., Tucakov, V.: Distance-based outlier: algorithms and applications. VLDB Journal 8(3-4), 237–253 (2000)CrossRefGoogle Scholar
  12. 12.
    Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Proc. Int. Conf. on Management of Data (SIGMOD), pp. 427–438 (2000)Google Scholar
  13. 13.
    Angiulli, F., Pizzuti, C.: Outlier mining in large high-dimensional data sets. IEEE Trans. Knowl. Data Eng. 2(17), 203–215 (2005)CrossRefGoogle Scholar
  14. 14.
    Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: A survey. ACM Comput. Surv. 41(3) (2009)Google Scholar
  15. 15.
    Barnett, V., Lewis, T.: Outliers in Statistical Data. John Wiley & Sons (1994)Google Scholar
  16. 16.
    Knorr, E., Ng, R.: Algorithms for mining distance-based outliers in large datasets. In: Proc. Int. Conf. on Very Large Databases (VLDB 1998), pp. 392–403 (1998)Google Scholar
  17. 17.
    Breunig, M.M., Kriegel, H., Ng, R., Sander, J.: Lof: Identifying density-based local outliers. In: Proc. Int. Conf. on Managment of Data, SIGMOD (2000)Google Scholar
  18. 18.
    Jin, W., Tung, A., Han, J.: Mining top-n local outliers in large databases. In: Proc. ACM Int. Conf. on Knowledge Discovery and Data Mining, KDD (2001)Google Scholar
  19. 19.
    Papadimitriou, S., Kitagawa, H., Gibbons, P., Faloutsos, C.: Loci: Fast outlier detection using the local correlation integral. In: Proc. Int. Conf. on Data Enginnering (ICDE), pp. 315–326 (2003)Google Scholar
  20. 20.
    Wang, B., Xiao, G., Yu, H., Yang, X.: Distance-based outlier detection on uncertain data. In: Proc. Computer and Information Technology (CIT), pp. 293–298 (2009)Google Scholar
  21. 21.
    Jiang, B., Pei, J.: Outlier detection on uncertain data: Objects, instances, and inference. In: Proc. Int. Conf. on Data Engineering, ICDE (2011)Google Scholar
  22. 22.
    Lepage, G.: A new algorithm for adaptive multidimensional integration. Journal of Computational Physics 27 (1978)Google Scholar
  23. 23.
    Rushdi, A.M., Al-Qasimi, A.: Efficient computation of the p.m.f. and the c.d.f. of the generalized binomial distribution. Microeletron. Reliab. 34(9), 1489–1499 (1994)CrossRefGoogle Scholar
  24. 24.
    Angiulli, F., Fassetti, F.: Dolphin: An efficient algorithm for mining distance-based outliers in very large datasets. ACM Trans. Knowl. Disc. Data 3(1), Art. 4 (2009)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2013

Authors and Affiliations

  • Fabrizio Angiulli
    • 1
  • Fabio Fassetti
    • 1
  1. 1.DIMES Dept.University of CalabriaRendeItaly

Personalised recommendations