Significant Pattern Mining with Confounding Variables
Recent pattern mining algorithms such as LAMP allow us to compute statistical significance of patterns with respect to an outcome variable. Their p-values are adjusted to control the family-wise error rate, which is the probability of at least one false discovery occurring. However, they are a poor fit for medical applications, due to their inability to handle potential confounding variables such as age or gender. We propose a novel pattern mining algorithm that evaluates statistical significance under confounding variables. Using a new testability bound based on the exact logistic regression model, the algorithm can exclude a large quantity of combination without testing them, limiting the amount of correction required for multiple testing. Using synthetic data, we showed that our method could remove the bias introduced by confounding variables while still detecting true patterns correlated with the class. In addition, we demonstrated application of data integration using a confounding variable.
KeywordsSignificant pattern mining Multiple testing Exact logistic regression
AT is supported by JST PRESTO and JSPS Research Fellowships for Young Scientists. The research of K.T. was supported by JST CREST, JST ERATO, RIKEN PostK, NIMS MI2I, Kakenhi Nanostructure and Kakenhi 15H05711.
- 2.Dut, S., Van Der Laan, M.J.: Multiple Testing Procedures with Applications to Genomics. Springer Science, Heidelberg (2007)Google Scholar
- 5.Janzing, D., et al.: Identifying confounders using additive noise models. In: Proceedings of the Twenty-Fifth Conference on UAI, pp. 249–257 (2009)Google Scholar
- 8.Menard, S.: Applied Logistic Regression Analysis, vol. 106. Sage, Beverley Hills (2002)Google Scholar
- 9.Minato, S., et al.: Fast statistical assessment for combinatorial hypotheses based on frequent itemset mining. In: Proceedings of ECML/PKDD 2014, pp. 422–436 (2014)Google Scholar
- 13.Sugiyama, M., López, F.L., Borgwardt, K.M.: Multiple testing correction in graph mining. In: Proceedings of SDM 2015, pp. 37–45 (2015)Google Scholar
- 16.Uno, T., et al.: LCM: an efficient algorithm for enumerating frequent closed item sets. In: Proceedings of FIMI 2003 (2003)Google Scholar
- 17.Webb, G.I.: Discovering significant rules. In: Proceedings of KDD 2006, pp. 434–443 (2006)Google Scholar
- 18.Yan, X., Han, J.: gSpan: graph-based substructure pattern mining. In: ICDM 2002, pp. 721–724 (2002)Google Scholar