Significant Pattern Mining with Confounding Variables

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9651)

Abstract

Recent pattern mining algorithms such as LAMP allow us to compute statistical significance of patterns with respect to an outcome variable. Their p-values are adjusted to control the family-wise error rate, which is the probability of at least one false discovery occurring. However, they are a poor fit for medical applications, due to their inability to handle potential confounding variables such as age or gender. We propose a novel pattern mining algorithm that evaluates statistical significance under confounding variables. Using a new testability bound based on the exact logistic regression model, the algorithm can exclude a large quantity of combination without testing them, limiting the amount of correction required for multiple testing. Using synthetic data, we showed that our method could remove the bias introduced by confounding variables while still detecting true patterns correlated with the class. In addition, we demonstrated application of data integration using a confounding variable.

Keywords

Significant pattern mining Multiple testing Exact logistic regression 

References

  1. 1.
    Diaconis, P., Sturmfels, B.: Algebraic algorithms for sampling from conditional distributions. Ann. Stat. 26(1), 363–397 (1998)MathSciNetCrossRefMATHGoogle Scholar
  2. 2.
    Dut, S., Van Der Laan, M.J.: Multiple Testing Procedures with Applications to Genomics. Springer Science, Heidelberg (2007)Google Scholar
  3. 3.
    Helma, C., et al.: The predictive toxicology challenge 2000–2001. Bioinformatics 17(1), 107–108 (2001)CrossRefGoogle Scholar
  4. 4.
    Hirji, K.: Exact Analysis of Discrete Data. Taylor and Francis, London (2006)MATHGoogle Scholar
  5. 5.
    Janzing, D., et al.: Identifying confounders using additive noise models. In: Proceedings of the Twenty-Fifth Conference on UAI, pp. 249–257 (2009)Google Scholar
  6. 6.
    Karwa, V., Slavkovic, A.: Conditional inference given partial information in contingency tables using Markov bases. WIREs Comput. Stat. 5, 207–218 (2013)CrossRefGoogle Scholar
  7. 7.
    Mehta, C.R., Patel, N.R.: Exact logistic regression: theory and examples. Stat. Med. 14(19), 2143–2160 (1995)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Menard, S.: Applied Logistic Regression Analysis, vol. 106. Sage, Beverley Hills (2002)Google Scholar
  9. 9.
    Minato, S., et al.: Fast statistical assessment for combinatorial hypotheses based on frequent itemset mining. In: Proceedings of ECML/PKDD 2014, pp. 422–436 (2014)Google Scholar
  10. 10.
    Noble, W.S.: How does multiple testing correction work? Nat. Biotechnol. 27(12), 1135–1137 (2009)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Pierce, D.A., Peters, D.: Improving on exact tests by approximate conditioning. Biometrika 86(2), 265–277 (1999)MathSciNetCrossRefMATHGoogle Scholar
  12. 12.
    Sokal, R., Rohlf, F.: Biometry, 3rd edn. Freeman, San Francisco (1995)MATHGoogle Scholar
  13. 13.
    Sugiyama, M., López, F.L., Borgwardt, K.M.: Multiple testing correction in graph mining. In: Proceedings of SDM 2015, pp. 37–45 (2015)Google Scholar
  14. 14.
    Tarone, R.: A modified bonferroni method for discrete data. Biometrics 46, 515–522 (1990)CrossRefMATHGoogle Scholar
  15. 15.
    Terada, A., et al.: Statistical significance of combinatorial regulations. Proc. Nat. Acad. Sci. USA 110(32), 12996–13001 (2013)MathSciNetCrossRefMATHGoogle Scholar
  16. 16.
    Uno, T., et al.: LCM: an efficient algorithm for enumerating frequent closed item sets. In: Proceedings of FIMI 2003 (2003)Google Scholar
  17. 17.
    Webb, G.I.: Discovering significant rules. In: Proceedings of KDD 2006, pp. 434–443 (2006)Google Scholar
  18. 18.
    Yan, X., Han, J.: gSpan: graph-based substructure pattern mining. In: ICDM 2002, pp. 721–724 (2002)Google Scholar
  19. 19.
    Zamar, D., McNeney, B., Graham, J.: elrm: software implementing exact-like inference for logistic regression models. J. Stat. Softw. 21, 1–18 (2007)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Department of Computational Biology and Medical Sciences, Graduate School of Frontier SciencesThe University of TokyoChibaJapan
  2. 2.Research Fellow of Japan Society for the Promotion of Science KojimachiJapan
  3. 3.Biotechnology Research Institute for Drug Discovery, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
  4. 4.Center for Materials Research by Information Integration, National Institute for Materials ScienceIbarakiJapan

Personalised recommendations