Skip to main content

Detecting outliers with one-class selective transfer machine

Abstract

In this paper, we propose an outlier detection method from an unlabeled target dataset by exploiting an unlabeled source dataset. Detecting outliers has attracted attention of data miners for over two decades, since such outliers can be crucial in decision making, knowledge discovery, and fraud detection, to name but a few. The fact that outliers are scarce and often tedious to label motivated researchers to propose detection methods from an unlabeled dataset, some of which borrow strengths from relevant labeled datasets in the framework of transfer learning. He et al. tackled a more challenging situation in which the input datasets coming from multiple tasks are all unlabeled. Their method, ML-OCSVM, conducts multi-task learning with one-class support vector machines (SVMs) and yields a mean model plus task-specific increments to detect outliers in the test datasets of the multiple tasks. We inherit a part of their problem setting, taking only unlabeled datasets in the input, but increase the difficulty by assuming only one source dataset in addition to the target dataset. Consequently, the source dataset consists of examples relevant to the target task as well as examples that are less relevant. To cope with this situation, we extend Selective Transfer Machine, which weights individual examples in the framework of covariate shift and learns an SVM classifier, to our one-class setting by replacing the binary SVMs with one-class SVMs. Experiments on two public datasets and an artificial dataset show that our method mostly outperforms baseline methods, including ML-OCSVM and a state-of-the-art ensemble anomaly detection method, in F1 score and AUC.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Notes

  1. 1.

    Transfer learning obtains knowledge while solving one or several source tasks and applies it to solve the target task.

  2. 2.

    We initially proposed OCSTM for detecting anomalous facial expressions [15]. This paper extends it to general outlier detection by adding experiments in ECG and synthetic domains.

  3. 3.

    As we stated in Sect. 1, the test dataset in classification is used as an unlabeled target dataset in several methods.

  4. 4.

    Chen and Liu tackled pain recognition as an application [37] and thus we use “pain” and “normal” as class labels for a clear explanation.

  5. 5.

    Bold lower-case letters represent a column vector \(\mathbf{x}\); \(x_j\) denotes the value of the jth element of \(\mathbf{x}\). \(\mathbf{I}_n\in \mathbb {R}^{n\times n}\) is an identity matrix. \(\mathbf{0}_d\) represents a zero vector whose length is d, and \(\mathbf{1}_d\) represents a one value vector whose length is d.

  6. 6.

    Since the source and target datasets can be regarded as the training and test datasets in classification, we use \(\mathrm{sc}\) and \(\mathrm{tar}\) in their symbols, respectively.

  7. 7.

    Intuitively, in estimating an expectation value of a probabilistic variable by sampling, importance sampling is a statistical technique to oversample the important region which contributes much to the expectation value.

  8. 8.

    Since we assume unlabeled data, we dropped y from the argument of the weighting function.

  9. 9.

    STM was originally proposed to a facial expression detection problem of a test subject based on training subjects. Since we handle TL, we call them the target task and the source tasks, respectively.

  10. 10.

    The representer theorem proves that the optimal solution can be written as a linear combination of kernel functions evaluated at the training (source) samples for the optimization problem on a loss function added a regularization term \(\lambda \Vert \mathbf{w}\Vert ^2\) [43].

  11. 11.

    In this subsection, we follow the original notations and unlike in the previous subsection do not stack 1 in each data vector \(\mathbf{x}_i^{\mathrm{sc}}\).

  12. 12.

    We give the details in the next paragraph.

  13. 13.

    One of the T tasks is chosen as the target, and the remaining ones are agglomerated as the source task. We will explain the details in Sect. 5.3.

  14. 14.

    Facial Action Units (AUs) represent changes in facial expression in terms of visually observable movements of the facial muscles [53, 54].

  15. 15.

    https://www.physionet.org/, https://github.com/MIT-LCP/wfdb-python.

  16. 16.

    In this case, Precision and Recall are both zero.

  17. 17.

    https://github.com/BorgwardtLab/sampling-outlier-detection.

  18. 18.

    This parameter \(\mu \) is different from \({\varvec{\mu }}_{\mathrm{n},t}\) in Eq. (34), however, we follow the symbol notation of [11] in this explanation.

  19. 19.

    These parameters \(\xi \) and \(\tau \) are different from \(\xi \) in Sect. 3.2 and Sect. 4 and \(\tau \) in Sect. 3.1.3, respectively; however, we follow the symbol notation of [22].

  20. 20.

    To optimize Eqs. (30), (33) in the experiments, we use the interior-point method [60] provided in “CVXOPT” python library (http://cvxopt.org/index.html).

  21. 21.

    The predicted values of the decision function are negative.

  22. 22.

    Recall again that we tuned the parameters of other methods and not ours.

References

  1. 1.

    Hawkins DM (1980) Identification of Outliers. Chapman and Hall, London

    Book  Google Scholar 

  2. 2.

    Knorr EM, Ng RT, Tucakov V (2000) Distance-based outliers: algorithms and applications. VLDB J. 8(3–4):237–253

    Article  Google Scholar 

  3. 3.

    Deguchi Y, Suzuki E (2015) Hidden fatigue detection for a desk worker using clustering of successive tasks. In: Ambient Intelligence, vol 9425 of LNCS. Springer, pp 268–283

  4. 4.

    Schölkopf B, Platt JC, Shawe-Taylor J, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1471

    Article  Google Scholar 

  5. 5.

    Tax DMJ, Duin RPW (2004) Support vector data description. Mach Learn 54(1):45–66

    Article  Google Scholar 

  6. 6.

    Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359

    Article  Google Scholar 

  7. 7.

    Ganin Y, Lempitsky V (2015) Unsupervised Domain Adaptation by Backpropagation. In: Proceedings of ICML, pp 1180–1189

  8. 8.

    Sener O, Song HO, Saxena A, Savarese S (2016) Learning transferrable representations for unsupervised domain adaptation. In: Proceedings of NIPS, pp 2110–2118

  9. 9.

    Long M, Zhu H, Wang J, Jordan MI (2016) Unsupervised Domain Adaptation with Residual Transfer Networks. In: Proceedings of NIPS, pp 136–144

  10. 10.

    Yang H, King I, Lyu MR (2010) Multi-task learning for one-class classification. In: Proceedings of IJCNN, pp 1–8

  11. 11.

    He X, Mourot G, Maquin D, Ragot J, Beauseroy P, Smolarz A, Grall-Maës E (2014) Multi-task learning with one-class SVM. Neurocomputing 133:416–426

    Article  Google Scholar 

  12. 12.

    Chu W-S, Torre FDL, Cohn JF (2017) Selective transfer machine for personalized facial expression analysis. IEEE Trans Pattern Anal Mach Intell 39(3):529–545

    Article  Google Scholar 

  13. 13.

    Gretton A, Smola A, Huang J, Schmittfull M, Borgwardt K, Schölkopf B (2009) Covariate shift by Kernel mean matching. In: Dataset shift in machine learning, chapter 8, pp 131–160. The MIT Press, Cambridge

  14. 14.

    Sugiyama M, Krauledat M, Müller K-R (2007) Covariate shift adaptation by importance weighted cross validation. J Mach Learn Res 8:985–1005

    MATH  Google Scholar 

  15. 15.

    Fujita H, Matsukawa T, Suzuki E (2018) One-class selective transfer machine for personalized anomalous facial expression detection. In: Proceedings of VISIGRAPP, vol 5: VISAPP, pp 274–283

  16. 16.

    Han J, Kamber M, Pei J (2012) Data mining, 3rd edn. Morgan Kaufmann, Waltham

    MATH  Google Scholar 

  17. 17.

    Schapire RE (1999) A brief introduction to boosting. In: Proceedings of IJCAI, pp 1401–1406

  18. 18.

    Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD, pp 226–231

  19. 19.

    Ankerst M, Breunig MM, Kriegel H-P, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: Proceedings of SIGMOD, pp 49–60

  20. 20.

    Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. Proc KDD 98:58–65

    Google Scholar 

  21. 21.

    Wang W, Yang J, Muntz RR (1997) STING: a statistical information grid approach to spatial data mining. In: Proceedings of VLDB, pp 186–195

  22. 22.

    Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of SIGMOD, pp 94–105

  23. 23.

    Breunig MM, Kriegel H-P, Ng RT, Sander Jörg J (2000) LOF: identifying density-based local outliers. Proc SIGMOD Rec 29(2):93–104

    Article  Google Scholar 

  24. 24.

    Sugiyama M, Borgwardt K (2013) Rapid distance-based outlier detection via sampling. In: Proceedings of NIPS, pp 467–475

  25. 25.

    Bellman RE (1961) Adaptive control processes: a guided tour. Princeton University Press, Princeton

    Book  Google Scholar 

  26. 26.

    Vapnik V (1995) The nature of statistical learning theory. Springer, New York

    Book  Google Scholar 

  27. 27.

    Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2(2):121–167

    Article  Google Scholar 

  28. 28.

    Liu H, Liu T, Wu J, Tao D, Fu Y (2015) Spectral ensemble clustering. In: Proceedings of KDD, pp 715–724

  29. 29.

    Zhao Y, Nasrullah Z, Hryniewicki MK, Li Z (2019) LSCP: locally selective combination in parallel outlier ensembles. In Proceedings of SDM

  30. 30.

    Bakker B, Heskes T (2003) Task clustering and gating for Bayesian multitask learning. J Mach Learn Res 4:83–99

    MATH  Google Scholar 

  31. 31.

    Yao Y, Doretto G (2010) Boosting for transfer learning with multiple sources. In: Proceedings of CVPR, pp 1855–1862

  32. 32.

    Ge L, Gao J, Ngo H, Li K, Zhang A (2014) On handling negative transfer and imbalanced distributions in multiple source transfer learning. Stat Anal Data Min ASA Data Sci J 7(4):254–271

    MathSciNet  Article  Google Scholar 

  33. 33.

    Cao B, Pan SJ, Zhang Y, Yeung D-Y, Yang Q (2010) Adaptive transfer learning. In: Proceedings of AAAI, pp 407–712

  34. 34.

    Tzeng E, Homan J, Darrell T, Saenko K (2015) Simultaneous deep transfer across domains and tasks. In: Proceedings of ICCV, pp 4068–4076

  35. 35.

    Chen J, Liu X, Tu P, Aragones A (2013) Learning person-specific models for facial expression and action unit recognition. Pattern Recognit Lett 34(15):1964–1970

    Article  Google Scholar 

  36. 36.

    Kodirov E, Xiang T, Fu Z-Y, Gong S (2015) Unsupervised domain adaptation for zero-shot learning. In: Proceedings of ICCV, pp 2452–2460

  37. 37.

    Chen J, Liu X (2014) Transfer learning with one-class data. Pattern Recognit Lett 37:32–40

    Article  Google Scholar 

  38. 38.

    Sangineto E, Zen G, Ricci E, Sebe N (2014) We are not all equal: personalizing models for facial expression analysis with transductive parameter transfer. In: Proceedings of ACM international conference on multimedia, pp 357–366

  39. 39.

    Zen G, Porzi L, Sangineto E, Ricci E, Sebe N (2016) Learning personalized models for facial expression analysis and gesture recognition. IEEE Trans Multimed 18(4):775–788

    Article  Google Scholar 

  40. 40.

    Sugiyama M, Nakajima S, Kashima H, Buenau PV, Kawanabe M (2008) Direct importance estimation with model selection and its application to covariate shift adaptation. In: Proceedings of NIPS, pp 1433–1440

  41. 41.

    LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444

    Article  Google Scholar 

  42. 42.

    Candela JQ, Sugiyama M, Schwaighofer A, Lawrence ND (2009) Dataset shift in machine learning. MIT Press, Cambridge

    Google Scholar 

  43. 43.

    Chapelle O (2007) Training a support vector machine in the primal. Neural Comput 19(5):1155–1178

    MathSciNet  Article  Google Scholar 

  44. 44.

    Amari S, Wu S (1999) Improving support vector machine classifiers by modifying kernel functions. Neural Netw 12(6):783–789

    Article  Google Scholar 

  45. 45.

    Gorski J, Pfeuffer F, Klamroth K (2007) Biconvex sets and optimization with biconvex functions: a survey and extensions. Math Methods Oper Res 66(3):373–407

    MathSciNet  Article  Google Scholar 

  46. 46.

    Monteiro RDC, Adler I (1989) Interior path following primal–dual algorithms. Part II: convex quadratic programming. Math Program 44(1–3):43–66

    Article  Google Scholar 

  47. 47.

    Lucey P, Cohn JF, Prkachin KM, Solomon PE, Matthews I (2011) Painful data: the UNBC-McMaster shoulder pain expression archive database. In: Proceedings of IEEE international conference on automatic face and gesture recognition and workshops, pp 57–64

  48. 48.

    Prkachin KM, Solomon PE (2008) The structure, reliability and validity of pain expression: evidence from patients with shoulder pain. Pain 139(2):267–274

    Article  Google Scholar 

  49. 49.

    Moody GB, Mark RG (2001) The impact of the MIT-BIH arrhythmia database. IEEE Eng Med Biol Mag 20(3):45–50

    Article  Google Scholar 

  50. 50.

    Lowe DG (1999) Object recognition from local scale-invariant features. In: Proceedings of ICCV, pp 1150–1157

  51. 51.

    Ahonen T, Hadid A, Pietikäinen M (2006) Face description with local binary patterns: application to face recognition. IEEE Trans Pattern Anal Mach Intell 28(12):2037–2041

    Article  Google Scholar 

  52. 52.

    Cootes TF, Edwards GJ, Taylor CJ (2001) Active appearance models. IEEE Trans Pattern Anal Mach Intell 23(6):681–685

    Article  Google Scholar 

  53. 53.

    Ekman P, Friesen WV (1975) Unmasking the face: a guide to recognizing emotions from facial cues. Prentice Hall, Englewood Cliffs

    Google Scholar 

  54. 54.

    Mohammadian A, Aghaeinia H, Towhidkhah F et al (2016) Subject adaptation using selective style transfer mapping for detection of facial action units. Expert Syst Appl 56:282–290

    Article  Google Scholar 

  55. 55.

    Pan J, Tompkins WJ (1985) A real-time QRS detection algorithm. IEEE Trans Biomed Eng BME–32(3):230–236

    Article  Google Scholar 

  56. 56.

    Yu S-N, Chen Y-H (2007) Electrocardiogram beat classification based on wavelet transformation and probabilistic neural network. Pattern Recognit Lett 28(10):1142–1150

    Article  Google Scholar 

  57. 57.

    Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  58. 58.

    Andrei N (2019) PyClustering: data mining library. J Open Source Softw 4(36):1230

    Article  Google Scholar 

  59. 59.

    Zhao Y, Nasrullah Z, Li Z (2019) PyOD: a python toolbox for scalable outlier detection. arXiv preprint arXiv:1901.01588

  60. 60.

    Andersen M, Dahl J, Liu Z, Vandenberghe L (2011) Interior-point methods for largescale cone programming. In: Optimization for machine learning, pp 55–83. MIT Press, Cambridge

Download references

Acknowledgements

A part of this research was supported by Grants-in-Aid for Scientific Research JP15K12100 and JP18H03290 from the Japan Society for the Promotion of Science (JSPS).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Einoshin Suzuki.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Fujita, H., Matsukawa, T. & Suzuki, E. Detecting outliers with one-class selective transfer machine. Knowl Inf Syst 62, 1781–1818 (2020). https://doi.org/10.1007/s10115-019-01407-5

Download citation

Keywords

  • One-class outlier detection
  • One-class support vector machines
  • Kernel mean matching
  • Transfer learning