Advertisement

Significant Pattern Mining with Confounding Variables

  • Aika Terada
  • David duVerle
  • Koji Tsuda
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9651)

Abstract

Recent pattern mining algorithms such as LAMP allow us to compute statistical significance of patterns with respect to an outcome variable. Their p-values are adjusted to control the family-wise error rate, which is the probability of at least one false discovery occurring. However, they are a poor fit for medical applications, due to their inability to handle potential confounding variables such as age or gender. We propose a novel pattern mining algorithm that evaluates statistical significance under confounding variables. Using a new testability bound based on the exact logistic regression model, the algorithm can exclude a large quantity of combination without testing them, limiting the amount of correction required for multiple testing. Using synthetic data, we showed that our method could remove the bias introduced by confounding variables while still detecting true patterns correlated with the class. In addition, we demonstrated application of data integration using a confounding variable.

Keywords

Significant pattern mining Multiple testing Exact logistic regression 

Notes

Acknowledgments

AT is supported by JST PRESTO and JSPS Research Fellowships for Young Scientists. The research of K.T. was supported by JST CREST, JST ERATO, RIKEN PostK, NIMS MI2I, Kakenhi Nanostructure and Kakenhi 15H05711.

References

  1. 1.
    Diaconis, P., Sturmfels, B.: Algebraic algorithms for sampling from conditional distributions. Ann. Stat. 26(1), 363–397 (1998)MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Dut, S., Van Der Laan, M.J.: Multiple Testing Procedures with Applications to Genomics. Springer Science, Heidelberg (2007)Google Scholar
  3. 3.
    Helma, C., et al.: The predictive toxicology challenge 2000–2001. Bioinformatics 17(1), 107–108 (2001)CrossRefGoogle Scholar
  4. 4.
    Hirji, K.: Exact Analysis of Discrete Data. Taylor and Francis, London (2006)zbMATHGoogle Scholar
  5. 5.
    Janzing, D., et al.: Identifying confounders using additive noise models. In: Proceedings of the Twenty-Fifth Conference on UAI, pp. 249–257 (2009)Google Scholar
  6. 6.
    Karwa, V., Slavkovic, A.: Conditional inference given partial information in contingency tables using Markov bases. WIREs Comput. Stat. 5, 207–218 (2013)CrossRefGoogle Scholar
  7. 7.
    Mehta, C.R., Patel, N.R.: Exact logistic regression: theory and examples. Stat. Med. 14(19), 2143–2160 (1995)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Menard, S.: Applied Logistic Regression Analysis, vol. 106. Sage, Beverley Hills (2002)Google Scholar
  9. 9.
    Minato, S., et al.: Fast statistical assessment for combinatorial hypotheses based on frequent itemset mining. In: Proceedings of ECML/PKDD 2014, pp. 422–436 (2014)Google Scholar
  10. 10.
    Noble, W.S.: How does multiple testing correction work? Nat. Biotechnol. 27(12), 1135–1137 (2009)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Pierce, D.A., Peters, D.: Improving on exact tests by approximate conditioning. Biometrika 86(2), 265–277 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Sokal, R., Rohlf, F.: Biometry, 3rd edn. Freeman, San Francisco (1995)zbMATHGoogle Scholar
  13. 13.
    Sugiyama, M., López, F.L., Borgwardt, K.M.: Multiple testing correction in graph mining. In: Proceedings of SDM 2015, pp. 37–45 (2015)Google Scholar
  14. 14.
    Tarone, R.: A modified bonferroni method for discrete data. Biometrics 46, 515–522 (1990)CrossRefzbMATHGoogle Scholar
  15. 15.
    Terada, A., et al.: Statistical significance of combinatorial regulations. Proc. Nat. Acad. Sci. USA 110(32), 12996–13001 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Uno, T., et al.: LCM: an efficient algorithm for enumerating frequent closed item sets. In: Proceedings of FIMI 2003 (2003)Google Scholar
  17. 17.
    Webb, G.I.: Discovering significant rules. In: Proceedings of KDD 2006, pp. 434–443 (2006)Google Scholar
  18. 18.
    Yan, X., Han, J.: gSpan: graph-based substructure pattern mining. In: ICDM 2002, pp. 721–724 (2002)Google Scholar
  19. 19.
    Zamar, D., McNeney, B., Graham, J.: elrm: software implementing exact-like inference for logistic regression models. J. Stat. Softw. 21, 1–18 (2007)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Department of Computational Biology and Medical Sciences, Graduate School of Frontier SciencesThe University of TokyoChibaJapan
  2. 2.Research Fellow of Japan Society for the Promotion of Science KojimachiJapan
  3. 3.Biotechnology Research Institute for Drug Discovery, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
  4. 4.Center for Materials Research by Information Integration, National Institute for Materials ScienceIbarakiJapan

Personalised recommendations