Data Mining and Knowledge Discovery

, Volume 8, Issue 3, pp 275–300 | Cite as

On-Line Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms

  • Kenji Yamanishi
  • Jun-ichi Takeuchi
  • Graham Williams
  • Peter Milne
Article

Abstract

Outlier detection is a fundamental issue in data mining, specifically in fraud detection, network intrusion detection, network monitoring, etc. SmartSifter is an outlier detection engine addressing this problem from the viewpoint of statistical learning theory. This paper provides a theoretical basis for SmartSifter and empirically demonstrates its effectiveness. SmartSifter detects outliers in an on-line process through the on-line unsupervised learning of a probabilistic model (using a finite mixture model) of the information source. Each time a datum is input SmartSifter employs an on-line discounting learning algorithm to learn the probabilistic model. A score is given to the datum based on the learned model with a high score indicating a high possibility of being a statistical outlier. The novel features of SmartSifter are: (1) it is adaptive to non-stationary sources of data; (2) a score has a clear statistical/information-theoretic meaning; (3) it is computationally inexpensive; and (4) it can handle both categorical and continuous variables. An experimental application to network intrusion detection shows that SmartSifter was able to identify data with high scores that corresponded to attacks, with low computational costs. Further experimental application has identified a number of meaningful rare cases in actual health insurance pathology data from Australia's Health Insurance Commission.

outlier detection intrusion detection anomaly detection fraud detection finite mixture model EM algorithm 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Allan, J., Carbonell, J., Doddington, G., Yamron, J., and Yang, Y. 1998. Topic detection and tracking pilot study: Final report, in Proc. of the DARPA Broadcast News Transcription and UnderstandingWorkshop, pp. 194–218.Google Scholar
  2. Barnett, V. and Lewis, T. 1994. Outliers in Statistical Data, John Wiley & Sons.Google Scholar
  3. Bonchi, F., Giannotti, F., Mainetto, G., and Pedeschi, D. 1999. A classification-based methodology for planning audit strategies in fraud detection. In Proc. of KDD-99, pp. 175–184.Google Scholar
  4. Burge, P. and Shawe-Taylor, J. 1997. Detecting cellular fraud using adaptive prototypes. In Proc. of AI Approaches to Fraud Detection and Risk Management, pp. 9–13.Google Scholar
  5. Chan, P. and Stolfo, S. 1998. Toward scalable learning with non-uniform class and cost-distributions: A case study in credit card fraud detection. In Proc. of KDD-98, AAAI-Press, pp. 164–168.Google Scholar
  6. Cover, T. and Thomas, J.A. 1991. Elements of Information Theory. Wiley-International.Google Scholar
  7. Dempster, A.P., Laird, N.M., and Ribin, D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B, 39(1):1–38.Google Scholar
  8. Fawcett, T. and Provost, F. 1997. Combining data mining and machine learning for effective fraud detection. In Proc. of AI Approaches to Fraud Detection and Risk Management, pp. 14–19.Google Scholar
  9. Fawcett, T. and Provost, F. 1999. Activity monitoring: Noticing interesting changes in behavior. In Proc. of KDD-99, pp. 53–62.Google Scholar
  10. Grabec, I. 1990. Self-organization of Neurons described by the maximum-entropy principle, Biological Cybernetics, 63:403–409.Google Scholar
  11. Guralnik, V. and Srivastava, J. 1999. Event detection from time series data. In Proc. KDD-99, pp. 33–42.Google Scholar
  12. Hawkins, D.M. 1980. Identification of Outliers. Chapman and Hall, London.Google Scholar
  13. Hunt, L.A. and Jorgensen, M.A. 1999. Mixture model clustering: A brief introduction to the MULTMIX program, Australian & New Zealand Journal of Statistics, 40:153–171.Google Scholar
  14. Knorr, E.M. and Ng, R.T. 1998. Algorithms for mining distance-based outliers in large datasets. In Proc. of the 24th VLDB Conference, pp. 392–403.Google Scholar
  15. Knorr, E.M. and Ng, R.T. 1999. Finding intensional knowledge of distance-based outliers. In Proc. of the 25th VLDB Conference, pp. 211–222.Google Scholar
  16. Krichevskii, R.E. and Trofimov, V.K. 1981. The performance of universal coding. IEEE Trans. Inform. Theory, IT-27(2):199–207.Google Scholar
  17. Lane, T. and Brodley, C. 1998. Approaches to on-line learning and concept drift for user identification in computer security. In Proc. of KDD-98, AAAI Press, pp. 66–72.Google Scholar
  18. Lee, W., Stolfo, S.J., and Mok, K.W. 1998. Mining audit data to build intrusion detection models. In Proc. of KDD-98.Google Scholar
  19. Lee, W., Stolfo, S.J., and Mok, K.W. 1999. Mining in a data-flow environment: Experience in network intrusion detection. In Proc. of KDD-99, pp. 114–124.Google Scholar
  20. Marron, J.S. and Wand, M.P. 1992. Exact mean integrated squared error. Annals of Statistics, 20:712–736.Google Scholar
  21. McLachlan, G. and Peel, D. 2000. Finite Mixture Models. Wiley Series in Probability and Statistics, John Wiley and Sons.Google Scholar
  22. Moreau,Y. and Vandewalle, J. Detection of mobile phone fraud using supervised neural networks:Afirst prototype, Available via: ftp://ftp.esat.kuleuven.ac.jp/pub/SISTA/moreau/reports/icann97 TR97-44.ps.Google Scholar
  23. Neal, R.M. and Hinton, G.E. 1993. A view of the EM algorithm that justifies incremental, sparse, and other variants, ftp://ftp.cs.toronto.edu/pub/radford/www/publications.htmlGoogle Scholar
  24. Ng, S.K. and McLachlan, G.J. 2002. On the choice of the number of blocks with the incremental EM algorithm for the fitting of normal mixtures. Statistics & Computing. In press. Available at http://www.maths.uq.edu.au/ gim/increm.psGoogle Scholar
  25. Rocke, D.M. 1996. Robustness properties of S-estimators of multivariate location and shape in high dimension. Annals of Statistics, 24(3):1327–1345.Google Scholar
  26. Rosset, S., Murad, U., Neumann, E., Idan,Y., and Pinkas, G. 1999. Discovery of fraud rules for telecommunicationschallenges and solutions. In Proc. of KDD-99, pp. 409–413.Google Scholar
  27. Williams, G.J. and Huang, Z. 1997. Mining the knowledge mine: The hot spots methodology for mining large real world databases. In Advanced Topics in Artificial Intelligence Lecture Notes in Artificial Intelligence, volume 1342, Springer-Verlag, pp. 340–348.Google Scholar
  28. Yamanishi, K., Takeuchi, J., Williams, G., and Milne, P. 2000. On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. In Proc. of KDD2000, ACM Press, pp. 250–254.Google Scholar

Copyright information

© Kluwer Academic Publishers 2004

Authors and Affiliations

  • Kenji Yamanishi
    • 1
  • Jun-ichi Takeuchi
    • 1
  • Graham Williams
    • 2
  • Peter Milne
    • 2
  1. 1.Internet Systems Research LaboratoriesNEC CorporationMiyamae, Kawasaki, KanagawaJapan
  2. 2.CSIRO Mathematical and Information SciencesCanberraAustralia

Personalised recommendations