Skip to main content

Significant Pattern Mining with Confounding Variables

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9651))

Included in the following conference series:

Abstract

Recent pattern mining algorithms such as LAMP allow us to compute statistical significance of patterns with respect to an outcome variable. Their p-values are adjusted to control the family-wise error rate, which is the probability of at least one false discovery occurring. However, they are a poor fit for medical applications, due to their inability to handle potential confounding variables such as age or gender. We propose a novel pattern mining algorithm that evaluates statistical significance under confounding variables. Using a new testability bound based on the exact logistic regression model, the algorithm can exclude a large quantity of combination without testing them, limiting the amount of correction required for multiple testing. Using synthetic data, we showed that our method could remove the bias introduced by confounding variables while still detecting true patterns correlated with the class. In addition, we demonstrated application of data integration using a confounding variable.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://code.google.com/p/lcmplusplus/.

References

  1. Diaconis, P., Sturmfels, B.: Algebraic algorithms for sampling from conditional distributions. Ann. Stat. 26(1), 363–397 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  2. Dut, S., Van Der Laan, M.J.: Multiple Testing Procedures with Applications to Genomics. Springer Science, Heidelberg (2007)

    Google Scholar 

  3. Helma, C., et al.: The predictive toxicology challenge 2000–2001. Bioinformatics 17(1), 107–108 (2001)

    Article  Google Scholar 

  4. Hirji, K.: Exact Analysis of Discrete Data. Taylor and Francis, London (2006)

    MATH  Google Scholar 

  5. Janzing, D., et al.: Identifying confounders using additive noise models. In: Proceedings of the Twenty-Fifth Conference on UAI, pp. 249–257 (2009)

    Google Scholar 

  6. Karwa, V., Slavkovic, A.: Conditional inference given partial information in contingency tables using Markov bases. WIREs Comput. Stat. 5, 207–218 (2013)

    Article  Google Scholar 

  7. Mehta, C.R., Patel, N.R.: Exact logistic regression: theory and examples. Stat. Med. 14(19), 2143–2160 (1995)

    Article  MathSciNet  Google Scholar 

  8. Menard, S.: Applied Logistic Regression Analysis, vol. 106. Sage, Beverley Hills (2002)

    Google Scholar 

  9. Minato, S., et al.: Fast statistical assessment for combinatorial hypotheses based on frequent itemset mining. In: Proceedings of ECML/PKDD 2014, pp. 422–436 (2014)

    Google Scholar 

  10. Noble, W.S.: How does multiple testing correction work? Nat. Biotechnol. 27(12), 1135–1137 (2009)

    Article  MathSciNet  Google Scholar 

  11. Pierce, D.A., Peters, D.: Improving on exact tests by approximate conditioning. Biometrika 86(2), 265–277 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  12. Sokal, R., Rohlf, F.: Biometry, 3rd edn. Freeman, San Francisco (1995)

    MATH  Google Scholar 

  13. Sugiyama, M., López, F.L., Borgwardt, K.M.: Multiple testing correction in graph mining. In: Proceedings of SDM 2015, pp. 37–45 (2015)

    Google Scholar 

  14. Tarone, R.: A modified bonferroni method for discrete data. Biometrics 46, 515–522 (1990)

    Article  MATH  Google Scholar 

  15. Terada, A., et al.: Statistical significance of combinatorial regulations. Proc. Nat. Acad. Sci. USA 110(32), 12996–13001 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  16. Uno, T., et al.: LCM: an efficient algorithm for enumerating frequent closed item sets. In: Proceedings of FIMI 2003 (2003)

    Google Scholar 

  17. Webb, G.I.: Discovering significant rules. In: Proceedings of KDD 2006, pp. 434–443 (2006)

    Google Scholar 

  18. Yan, X., Han, J.: gSpan: graph-based substructure pattern mining. In: ICDM 2002, pp. 721–724 (2002)

    Google Scholar 

  19. Zamar, D., McNeney, B., Graham, J.: elrm: software implementing exact-like inference for logistic regression models. J. Stat. Softw. 21, 1–18 (2007)

    Article  Google Scholar 

Download references

Acknowledgments

AT is supported by JST PRESTO and JSPS Research Fellowships for Young Scientists. The research of K.T. was supported by JST CREST, JST ERATO, RIKEN PostK, NIMS MI2I, Kakenhi Nanostructure and Kakenhi 15H05711.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Koji Tsuda .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Terada, A., duVerle, D., Tsuda, K. (2016). Significant Pattern Mining with Confounding Variables. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J., Wang, R. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2016. Lecture Notes in Computer Science(), vol 9651. Springer, Cham. https://doi.org/10.1007/978-3-319-31753-3_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-31753-3_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-31752-6

  • Online ISBN: 978-3-319-31753-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics