Knowledge and Information Systems

, Volume 26, Issue 2, pp 309–336 | Cite as

Statistical outlier detection using direct density ratio estimation

  • Shohei HidoEmail author
  • Yuta Tsuboi
  • Hisashi Kashima
  • Masashi Sugiyama
  • Takafumi Kanamori
Regular Paper


We propose a new statistical approach to the problem of inlier-based outlier detection, i.e., finding outliers in the test set based on the training set consisting only of inliers. Our key idea is to use the ratio of training and test data densities as an outlier score. This approach is expected to have better performance even in high-dimensional problems since methods for directly estimating the density ratio without going through density estimation are available. Among various density ratio estimation methods, we employ the method called unconstrained least-squares importance fitting (uLSIF) since it is equipped with natural cross-validation procedures, allowing us to objectively optimize the value of tuning parameters such as the regularization parameter and the kernel width. Furthermore, uLSIF offers a closed-form solution as well as a closed-form formula for the leave-one-out error, so it is computationally very efficient and is scalable to massive datasets. Simulations with benchmark and real-world datasets illustrate the usefulness of the proposed approach.


Outlier detection Density ratio Importance Unconstrained least-squares importance fitting (uLSIF) 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Akiyama T, Hachiya H, Sugiyama M (2010) Efficient exploration through active learning for value function approximation in reinforcement learning, Neural Netw (to appear)Google Scholar
  2. 2.
    Best MJ (1982) An algorithm for the solution of the parametric quadratic programming problem, Technical Report 82-24, Faculty of Mathematics, University of WaterlooGoogle Scholar
  3. 3.
    Bickel S, Brückner M, Scheffer T (2007) Discriminative learning for differing training and test distributions. In: Proceedings of the 24th international conference on machine learning, pp 81–88Google Scholar
  4. 4.
    Bickel S, Scheffer T (2007) Dirichlet-enhanced spam filtering based on biased samples. In: Advances in neural information processing systems 19. MIT Press, Cambridge, pp 161–168Google Scholar
  5. 5.
    Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, CambridgezbMATHGoogle Scholar
  6. 6.
    Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn 30(7): 1145–1159CrossRefGoogle Scholar
  7. 7.
    Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 93–104Google Scholar
  8. 8.
    Chan J, Bailey J, Leckie C (2008) Discovering correlated spatio-temporal changes in evolving graphs. Knowl Inform Syst 16(1): 53–96CrossRefGoogle Scholar
  9. 9.
    Cheng KF, Chu CK (2004) Semiparametric density estimation under a two-sample density ratio model. Bernoulli 10(4): 583–604zbMATHCrossRefMathSciNetGoogle Scholar
  10. 10.
    Cover TM, Thomas JA (1991) Elements of information theory. Wiley, New YorkzbMATHCrossRefGoogle Scholar
  11. 11.
    Efron B, Hastie T, Johnstone I, Tibshirani R (2002) Least angle regression. Ann Stat 32: 407–499MathSciNetGoogle Scholar
  12. 12.
    Fan H, Zaïane OR, Foss A, Wu J (2009) Resolution-based outlier factor: detecting the top-n most outlying data points in engineering data. Knowl Inf Syst 19(1): 31–51CrossRefGoogle Scholar
  13. 13.
    Fernandez EA (2005) The dprep package, Technical report, University of Puerto Rico.
  14. 14.
    Fishman GS (1996) Monte carlo: concepts, algorithms, and applications. Springer, BerlinzbMATHGoogle Scholar
  15. 15.
    Fujimaki R, Yairi T, Machida K (2005) An approach to spacecraft anomaly detection problem using kernel feature space. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining, pp 401–410Google Scholar
  16. 16.
    Gao J, Cheng H, Tan P-N (2006a) A novel framework for incorporating labeled examples into anomaly detection. In: Proceedings of the 2006 SIAM international conference on data mining, pp 593–597Google Scholar
  17. 17.
    Gao J, Cheng H, Tan P-N (2006b) Semi-supervised outlier detection. In: Proceedings of the 2006 ACM symposium on applied computing, pp 635–636Google Scholar
  18. 18.
    Golub GH, Loan CFV (1996) Matrix computations. Johns Hopkins University Press, BaltimorezbMATHGoogle Scholar
  19. 19.
    Hachiya H, Akiyama T, Sugiyama M, Peters J (2009) Adaptive importance sampling for value function approximation in off-policy reinforcement learning. Neural Netw 22(10): 1399–1410CrossRefGoogle Scholar
  20. 20.
    Hachiya H, Peters J, Sugiyama M (2009) Efficient sample reuse in M-based policy search. In: Buntine W, Grobelnik M, Mladenic D, Shawe-Taylor J (eds) Machine learning and knowledge discovery in databases, vol 5781. Lecture notes in computer science, Springer, Berlin, pp 469–484Google Scholar
  21. 21.
    Härdle W, Müller M, Sperlich S, Werwatz A (2004) Nonparametric and semiparametric models. Springer series in statisticsGoogle Scholar
  22. 22.
    Hastie T, Rosset S, Tibshirani R, Zhu J (2004) The entire regularization path for the support vector machine. J Mach Learn Res 5: 1391–1415MathSciNetGoogle Scholar
  23. 23.
    Hido S, Tsuboi Y, Kashima H, Sugiyama M, Kanamori T (2008) Inlier-based outlier detection via direct density ratio estimation. In: Proceedings of the 8th IEEE international conference on data mining, pp 223–232Google Scholar
  24. 24.
    Hodge V, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2): 85–126zbMATHCrossRefGoogle Scholar
  25. 25.
    Huang J, Smola AJ, Gretton A, Borgwardt K, Schölkopf B (2007) Correcting sample selection bias by unlabeled data. In: Advances in neural information processing systems, vol 19Google Scholar
  26. 26.
    Idé T, Kashima H (2004) Eigenspace-based anomaly detection in computer systems. In: Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining, pp 440–449Google Scholar
  27. 27.
    Jiang X, Zhu X (2009) veye: behavioral footprinting for self-propagating worm detection and profiling. Knowl Inform Syst 18(2): 231–262CrossRefGoogle Scholar
  28. 28.
    Kanamori T (2007) Pool-based active learning with optimal sampling distribution and its information geometrical interpretation. Neurocomputing 71(1–3): 353–362CrossRefGoogle Scholar
  29. 29.
    Kanamori T, Hido S, Sugiyama M (2009a) Efficient direct density ratio estimation for non-stationarity adaptation and outlier detection. In: Koller D, Schuurmans D, Bengio Y, Bottou L (eds) Advances in neural information processing systems 21. MIT Press, Cambridge, pp 809–816Google Scholar
  30. 30.
    Kanamori T, Hido S, Sugiyama M (2009b) A least-squares approach to direct importance estimation. J Mach Learn Res 10: 1391–1445MathSciNetGoogle Scholar
  31. 31.
    Kanamori T, Shimodaira H (2003) Active learning algorithm using the maximum weighted log-likelihood estimator. J Stat Plan Inference 116(1): 149–162zbMATHCrossRefMathSciNetGoogle Scholar
  32. 32.
    Kanamori T, Suzuki T, Sugiyama M (2009) Condition number analysis of kernel-based density ratio estimation, Technical report, arXiv.
  33. 33.
    Karatzoglou A, Smola A, Hornik K, Zeileis A (2004) kernlab—an S4 package for kernel methods in R. J Stat Softw 11(9): 1–20Google Scholar
  34. 34.
    Kawahara Y, Sugiyama M (2009) Change-point detection in time-series data by direct density-ratio estimation, In: Park H, Parthasarathy S, Liu H, Obradovic Z (eds) Proceedings of 2009 SIAM international conference on data mining (SDM2009). Sparks, Nevada, USA, pp 389–400Google Scholar
  35. 35.
    Latecki LJ, Lazarevic A, Pokrajac D (2007) Outlier detection with kernel density functions. In: Proceedings of the 5th international conference on machine learning and data mining in pattern recognition, pp 61–75Google Scholar
  36. 36.
    Li X, Liu B, Ng S-K (2007) Learning to identify unexpected instances in the test set. In: Proceedings of the 20th international joint conference on artificial intelligence, pp 2802–2807Google Scholar
  37. 37.
    Li Y, Koike Y, Sugiyama M (2009) A framework of adaptive brain computer interfaces. In: Proceedings of the 2nd international conference on biomedical engineering and informatics (BMEI09), Tianjin, China, pp 473–477Google Scholar
  38. 38.
    Liu B, Dai Y, Li X, Lee WS, Yu PS (2003) Building text classifiers using positive and unlabeled examples. In: Proceedings of the 3rd IEEE international conference on data mining, pp 179–186Google Scholar
  39. 39.
    Manevitz LM, Yousef M (2002) One-class SVMs for document classification. J Mach Learn Res 2: 139–154zbMATHCrossRefGoogle Scholar
  40. 40.
    Minka TP (2007) A comparison of numerical optimizers for logistic regression, Technical report, Microsoft ResearchGoogle Scholar
  41. 41.
    Murray JF, Hughes GF, Kreutz-Delgado K (2005) Machine learning methods for predicting failures in hard drives: a multiple-instance application. J Mach Learn Res 6: 783–816MathSciNetGoogle Scholar
  42. 42.
    Nguyen X, Wainwright MJ, Jordan MI (2008) Estimating divergence functions and the likelihood ratio by penalized convex risk minimization. In: Advances in neural information processing systems 20, pp 1089–1096Google Scholar
  43. 43.
    Qin J (1998) Inferences for case-control and semiparametric two-sample density ratio models. Biometrika 85(3): 619–639zbMATHCrossRefMathSciNetGoogle Scholar
  44. 44.
    Quiñonero-Candela, J, Sugiyama, M, Schwaighofer, A, Lawrence, N (eds) (2009) Dataset Shift in Machine Learning. MIT Press, CambridgeGoogle Scholar
  45. 45.
    R Development Core Team (2008) The R Manuals.
  46. 46.
    Rätsch G, Onoda T, Müller KR (2001) Soft margins for AdaBoost. Mach Learn 42(3): 287–320zbMATHCrossRefGoogle Scholar
  47. 47.
    Schölkopf B, Platt JC, Shawe-Taylor J, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7): 1443–1471zbMATHCrossRefGoogle Scholar
  48. 48.
    Schölkopf B, Smola AJ (2002) Learning with kernels. MIT Press, CambridgeGoogle Scholar
  49. 49.
    Shimodaira H (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function. J Stat Plan Inference 90(2): 227–244zbMATHCrossRefMathSciNetGoogle Scholar
  50. 50.
    Smola A, Song L, Teo CH (2009) Relative novelty detection. In: Proceedings of the 14th international workshop on artificial intelligence and statistics, vol 5, pp 536–543Google Scholar
  51. 51.
    Stein M, Branke J, Schmeck H (2008) Efficient implementation of an active set algorithm for large-scale portfolio selection. Comput Oper Res 35(12): 3945–3961zbMATHCrossRefGoogle Scholar
  52. 52.
    Steinwart I (2001) On the influence of the kernel on the consistency of support vector machines. J Mach Learn Res 2: 67–93CrossRefMathSciNetGoogle Scholar
  53. 53.
    Sugiyama M (2006) Active learning in approximately linear regression based on conditional expectation of generalization error. J Mach Learn Res 7: 141–166MathSciNetGoogle Scholar
  54. 54.
    Sugiyama M (2007) Dimensionality reduction of multimodal labeled data by local Fisher discriminant analysis. J Mach Learn Res 8: 1027–1061Google Scholar
  55. 55.
    Sugiyama M, Idé T, Nakajima S, Sese J (2010) Semi-supervised local Fisher discriminant analysis for dimensionality reduction. Mach Learn 78(1–2): 35–61CrossRefGoogle Scholar
  56. 56.
    Sugiyama M, Kanamori T, Suzuki T, Hido S, Sese J, Takeuchi I, Wang L (2009) A density-ratio framework for statistical data processing. IPSJ Trans Comput Vis Appl 1: 183–208CrossRefGoogle Scholar
  57. 57.
    Sugiyama M, Kawanabe M, Chui PL (2010) Dimensionality reduction for density ratio estimation in high-dimensional spaces. Neural Netw 23(1): 44–59CrossRefGoogle Scholar
  58. 58.
    Sugiyama M, Krauledat M, Müller K-R (2007) Covariate shift adaptation by importance weighted cross validation. J Mach Learn Res 8: 985–1005Google Scholar
  59. 59.
    Sugiyama M, Müller K-R (2005) Input-dependent estimation of generalization error under covariate shift. Stat Decis 23(4): 249–279zbMATHCrossRefGoogle Scholar
  60. 60.
    Sugiyama M, Nakajima S (2009) Pool-based active learning in approximate linear regression. Mach Learn 75(3): 249–274CrossRefGoogle Scholar
  61. 61.
    Sugiyama M, Nakajima S, Kashima H, von Bünau P, Kawanabe M (2008) Direct importance estimation with model selection and its application to covariate shift adaptation. In: Advances in neural information processing systems 20, pp 1433–1440Google Scholar
  62. 62.
    Sugiyama M, Suzuki T, Nakajima S, Kashima H, von Bünau P, Kawanabe M (2008) Direct importance estimation for covariate shift adaptation. Ann Inst Stat Math 60(4)Google Scholar
  63. 63.
    Sugiyama M, Takeuchi I, Suzuki T, Kanamori T, Hachiya H, Okanohara D (2010) Least-squares conditional density estimation. EICE Trans Inform Syst E93-D(3) (to appear)Google Scholar
  64. 64.
    Sugiyama M, von Bünau P, Kawanabe M, Müller K-R (2010) Covariate shift adaptation: towards machine learning in non-stationary environment, MIT Press, Cambridge (to appear)Google Scholar
  65. 65.
    Suzuki T, Sugiyama M (2009a) Estimating squared-loss mutual information for independent component analysis., In: Adali T, Jutten C, Romano JMT, Barros AK (eds) Independent component analysis and signal separation, vol 544. Lecture notes in computer science, Springer, Berlin, pp 130–137Google Scholar
  66. 66.
    Suzuki T, Sugiyama M (2009b) Sufficient dimension reduction via squared-loss mutual information estimation, Technical Report TR09-0005, Department of Computer Science, Tokyo Institute of Technology.
  67. 67.
    Suzuki T, Sugiyama M, Kanamori T, Sese J (2009) Mutual information estimation reveals global associations between stimuli and biological processes. BMC Bioinform 10(1): S52CrossRefGoogle Scholar
  68. 68.
    Suzuki T, Sugiyama M, Sese J, Kanamori T (2008) Approximating mutual information by maximum likelihood density ratio estimation. In: Saeys Y, Liu H, Inza I, Wehenkel L, de Peer YV (eds) JMLR workshop and conference proceedings, vol 4. New challenges for feature selection in data mining and knowledge discovery, pp 5–20Google Scholar
  69. 69.
    Suzuki T, Sugiyama M, Tanaka T (2009) Mutual information approximation via maximum likelihood estimation of density ratio. In: Proceedings of 2009 IEEE international symposium on information theory (ISIT2009), Seoul, Korea, pp 463–467Google Scholar
  70. 70.
    Takimoto M, Matsugu M, Sugiyama M (2009) Visual inspection of precision instruments by least-squares outlier detection. In: Proceedings of the fourth international workshop on data-mining and statistical science (DMSS2009), Kyoto, Japan, pp 22–26Google Scholar
  71. 71.
    Tax DMJ, Duin RPW (2004) Support vector data description. Mach Learn 54(1): 45–66zbMATHCrossRefGoogle Scholar
  72. 72.
    Tsuboi Y, Kashima H, Hido S, Bickel S, Sugiyama M (2009) Direct density ratio estimation for large-scale covariate shift adaptation. J Inform Process 17: 138–155CrossRefGoogle Scholar
  73. 73.
    Vapnik VN (1998) Statistical learning theory. Wiley, New YorkzbMATHGoogle Scholar
  74. 74.
    Wiens DP (2000) Robust weights and designs for biased regression models: least squares and generalized M-estimation. J Stat Plann Inference 83(2): 395–412zbMATHCrossRefMathSciNetGoogle Scholar
  75. 75.
    Yamada M, Sugiyama M (2009) Direct importance estimation with Gaussian mixture models. In: IEICE transactions on information and systems E92-D(10), pp 2159–2162Google Scholar
  76. 76.
    Yamada M, Sugiyama M, Matsui T (2010) Semi-supervised speaker identification under covariate shift. Signal Process (to appear)Google Scholar
  77. 77.
    Yamanishi K, Takeuchi J-I, Williams G, Milne P (2004) On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. Data Mining Knowl Discov 8(3): 275–300CrossRefMathSciNetGoogle Scholar
  78. 78.
    Yankov D, Keogh E, Rebbapragada U (2008) Disk aware discord discovery: finding unusual time series in terabyte sized datasets. Knowl Inform Syst 17(2): 241–262CrossRefGoogle Scholar
  79. 79.
    Zadrozny B (2004) Learning and evaluating classifiers under sample selection bias. In: Proceedings of the twenty-first international conference on machine learning, ACM Press, New York, pp 903–910Google Scholar

Copyright information

© Springer-Verlag London Limited 2010

Authors and Affiliations

  • Shohei Hido
    • 1
    • 2
    Email author
  • Yuta Tsuboi
    • 1
  • Hisashi Kashima
    • 1
    • 6
  • Masashi Sugiyama
    • 3
    • 4
  • Takafumi Kanamori
    • 5
  1. 1.IBM Research - TokyoKanagawaJapan
  2. 2.Department of Systems ScienceGraduate School of Informatics, Kyoto UniversityKyotoJapan
  3. 3.Department of Computer Science, Graduate School of Information Science and EngineeringTokyo Institute of TechnologyTokyoJapan
  4. 4.PRESTO, Japan Science and Technology AgencyKawaguchiJapan
  5. 5.Department of Computer Science and Mathematical Informatics, Graduate School of Information ScienceNagoya UniversityNagoyaJapan
  6. 6.Department of Mathematical Informatics, Graduate School of Information Science and TechnologyThe University of TokyoTokyoJapan

Personalised recommendations