Advertisement

Knowledge and Information Systems

, Volume 47, Issue 1, pp 99–129 | Cite as

Efficient discovery of contrast subspaces for object explanation and characterization

  • Lei DuanEmail author
  • Guanting Tang
  • Jian Pei
  • James Bailey
  • Guozhu Dong
  • Vinh Nguyen
  • Akiko Campbell
  • Changjie Tang
Regular Paper

Abstract

We tackle the novel problem of mining contrast subspaces. Given a set of multidimensional objects in two classes \(C_+\) and \(C_-\) and a query object \(o\), we want to find the top-\(k\) subspaces that maximize the ratio of likelihood of \(o\) in \(C_+\) against that in \(C_-\). Such subspaces are very useful for characterizing an object and explaining how it differs between two classes. We demonstrate that this problem has important applications, and, at the same time, is very challenging, being MAX SNP-hard. We present CSMiner, a mining method that uses kernel density estimation in conjunction with various pruning techniques. We experimentally investigate the performance of CSMiner on a range of data sets, evaluating its efficiency, effectiveness, and stability and demonstrating it is substantially faster than a baseline method.

Keywords

Contrast subspace Kernel density estimation Likelihood contrast 

Notes

Acknowledgments

The authors are grateful to the editor and the anonymous reviewers for their constructive comments, which help to improve this paper. Lei Duan’s research was supported in part by National Natural Science Foundation of China (Grant No. 61103042), China Postdoctoral Science Foundation (Grant No. 2014M552371), and SRFDP 20100181120029. Jian Pei’s and Guanting Tang’s research was supported in part by an NSERC Discovery grant, a BCIC NRAS Team Project. James Bailey’s work was supported by an ARC Future Fellowship (FT110100112). Work by Lei Duan and Guozhu Dong at Simon Fraser University was supported in part by an Ebco/Eppich visiting professorship. All opinions, findings, conclusions, and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agencies.

References

  1. 1.
    Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. ACM Sigmod Rec 30:37–46CrossRefGoogle Scholar
  2. 2.
    Bache K, Lichman M (2013) UCI machine learning repositoryGoogle Scholar
  3. 3.
    Bay SD, Pazzani MJ (2001) Detecting group differences: mining contrast sets. Data Min Knowl Discov 5(3):213–246CrossRefzbMATHGoogle Scholar
  4. 4.
    Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Proc. of the 7th Int’l Conf on Database Theory, pp 217–235Google Scholar
  5. 5.
    Böhm K, Keller F, Müller E, Nguyen HV, Vreeken J (2013) CMI: An information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In: Proc. of the 13th SIAM Int’l Conf on Data Min, pp 198–206Google Scholar
  6. 6.
    Breunig MM, Kriegel HP, Ng RT, Sander J (2000) LOF: Identifying density-based local outliers. In: Proc. of the 2000 ACM SIGMOD Int’l Conf on Manag of data, pp 93–104Google Scholar
  7. 7.
    Cai Y, Zhao HK, Han H, Lau RYK, Leung HF, Min H (2012) Answering typicality query based on automatically prototype construction. In: Proc. of the 2012 IEEE/WIC/ACM Int’l Joint Conf Web Intell Intell Agent Technol, 01:362–366Google Scholar
  8. 8.
    Chen L, Dong G (2006) Masquerader detection using OCLEP: one class classification using length statistics of emerging patterns. In: Proc. of Int’l workshop on information Processing over Evolving Networks (WINPEN), p 5Google Scholar
  9. 9.
    Dong G, Bailey J (eds) (2013) Contrast data mining: concepts, algorithms, and applications. CRC Press, Boca RatonGoogle Scholar
  10. 10.
    Dong G, Li J (1999) Efficient mining of emerging patterns: discovering trends and differences. In: Proc. of the 5th ACM SIGKDD Int’l Conf on Knowledge Discovery and Data Mining, pp 43–52Google Scholar
  11. 11.
    Duan L, Tang G, Pei J, Bailey J, Dong G, Campbell A, Tang C (2014) Mining contrast subspaces. In: Proc. of the 18th Pacific-Asia Conf on Knowledge Discovery and Data Mining, pp 249–260Google Scholar
  12. 12.
    Fagin R, Kumar R, Sivakumar D (2003) Comparing top k lists. In: Proc. of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp 28–36Google Scholar
  13. 13.
    He Z, Xu X, Huang ZJ, Deng S (2005) FP-outlier: frequent pattern based outlier detection. Comput Sci Inf Syst 2(1):103–118CrossRefGoogle Scholar
  14. 14.
    Hua M, Pei J, Fu AW, Lin X, Leung HF (2009) Top-k typicality queries and efficient query answering methods on large databases. VLDB J 18(3):809–835CrossRefGoogle Scholar
  15. 15.
    Jeffreys H (1961) The theory of probability, 3rd edn. OxfordGoogle Scholar
  16. 16.
    Keller F, Müller E, Böhm K (2012) HiCS: high contrast subspaces for density-based outlier ranking. In: Proc. of the IEEE 28th Int’l Conf on Data Engineering, pp 1037–1048Google Scholar
  17. 17.
    Kriegel HP, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: Proc. of the 14th ACM SIGKDD Int’l Conf on Knowledge Discovery and Data Mining, pp 444–452Google Scholar
  18. 18.
    Kriegel HP, Kröger P, Schubert E, Zimek A (2009) Outlier detection in axis-parallel subspaces of high dimensional data. In: Proc. of the 13th Pacific-Asia Conf on Knowledge Discovery and Data Mining, pp 831–838Google Scholar
  19. 19.
    Novak PK, Lavrac N, Webb GI (2009) Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. J Mach Learn Res 10:377–403zbMATHGoogle Scholar
  20. 20.
    Papadimitriou CH, Yannakakis M (1991) Optimization, approximation, and complexity classes. J Comput Syst Sci 43(3):425–440MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Rymon R (1992) Search through systematic set enumeration. In: Proc. of the 3rd Int’l Conf on Principles of Knowledge Representation and Reasoning, pp 539–550Google Scholar
  22. 22.
    Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall/CRC, LondonCrossRefzbMATHGoogle Scholar
  23. 23.
    Wang L, Zhao H, Dong G, Li J (2005) On the complexity of finding emerging patterns. Theor Comput Sci 335(1):15–27MathSciNetCrossRefzbMATHGoogle Scholar
  24. 24.
    Webber W, Moffat A, Zobel J (2010) A similarity measure for indefinite rankings. ACM Trans Inf Syst 28(4):20:1–20:38CrossRefGoogle Scholar
  25. 25.
    Wrobel S (1997) An algorithm for multi-relational discovery of subgroups. In: Proc. of the 1st European Symposium on Principles of Data Mining and Knowledge Discovery, pp 78–87Google Scholar
  26. 26.
    Wu S, Crestani F (2003) Methods for ranking information retrieval systems without relevance judgments. In: Proc. of the 2003 ACM Symposium on Applied Computing. ACM, New York, NY, USA, pp 811–816Google Scholar

Copyright information

© Springer-Verlag London 2015

Authors and Affiliations

  • Lei Duan
    • 1
    Email author
  • Guanting Tang
    • 2
  • Jian Pei
    • 2
  • James Bailey
    • 3
  • Guozhu Dong
    • 4
  • Vinh Nguyen
    • 3
  • Akiko Campbell
    • 5
  • Changjie Tang
    • 1
  1. 1.School of Computer ScienceSichuan UniversityChengduChina
  2. 2.School of Computing ScienceSimon Fraser UniversityBurnabyCanada
  3. 3.Department of Computing and Information SystemsThe University of MelbourneMelbourneAustralia
  4. 4.Department of Computer Science and EngineeringWright State UniversityDaytonUSA
  5. 5.Pacific Blue CrossBurnabyCanada

Personalised recommendations