Abstract
In many scientific communities using experiment databases, one of the crucial problems is how to assess the statistical significance (p-value) of a discovered hypothesis. Especially, combinatorial hypothesis assessment is a hard problem because it requires a multiple-testing procedure with a very large factor of the p-value correction. Recently, Terada et al. proposed a novel method of the p-value correction, called “Limitless Arity Multiple-testing Procedure” (LAMP), which is based on frequent itemset enumeration to exclude meaninglessly infrequent itemsets which will never be significant. The LAMP makes much more accurate p-value correction than previous method, and it empowers the scientific discovery. However, the original LAMP implementation is sometimes too time-consuming for practical databases. We propose a new LAMP algorithm that essentially executes itemset mining algorithm once, while the previous one executes many times. Our experimental results show that the proposed method is much (10 to 100 times) faster than the original LAMP. This algorithm enables us to discover significant p-value patterns in quite short time even for very large-scale databases.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: Buneman, P., Jajodia, S. (eds.) Proc. of the 1993 ACM SIGMOD International Conference on Management of Data. SIGMOD Record, vol. 22(2), pp. 207–216 (1993)
Arimura, H., Uno, T.: A polynomial space and polynomial delay algorithm for enumeration of maximal motifs in a sequence. In: Deng, X., Du, D.-Z. (eds.) ISAAC 2005. LNCS, vol. 3827, pp. 724–737. Springer, Heidelberg (2005)
Arimura, H., Uno, T., Shimozono, S.: Time and space efficient discovery of maximal geometric graphs. In: Corruble, V., Takeda, M., Suzuki, E. (eds.) DS 2007. LNCS (LNAI), vol. 4755, pp. 42–55. Springer, Heidelberg (2007)
Asai, T., Abe, K., Kawasoe, S., Sakamoto, H., Arimura, H., Arikawa, S.: Efficient substructure discovery from large semi-structured data. IEICE Trans. on Information and Systems E87-D(12), 2754–2763 (2004)
Asai, T., Arimura, H., Uno, T., Nakano, S.-i.: Discovering frequent substructures in large unordered trees. In: Grieser, G., Tanaka, Y., Yamamoto, A. (eds.) DS 2003. LNCS (LNAI), vol. 2843, pp. 47–61. Springer, Heidelberg (2003)
Boley, M., Gärtner, T., Grosskreutz, H., Fraunhofer, I.: Formal concept sampling for counting and threshold-free local pattern mining. In: Proc. of 2010 SIAM International Conference on Data Mining (SDM 2010), pp. 177–188 (April 2010)
Bonferroni, C.: Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze (Libreria Internazionale Seeber, Florence, Italy), 8: Article 3–62 (1936)
Gallo, A., De Bie, T., Cristianini, N.: MINI: Mining informative non-redundant itemsets. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 438–445. Springer, Heidelberg (2007)
Goethals, B.: Survey on frequent pattern mining (2003), http://www.cs.helsinki.fi/u/goethals/publications/survey.ps
Goethals, B., Zaki, M.J.: Frequent itemset mining dataset repository. In: Frequent Itemset Mining Implementations, FIMI 2003 (2003), http://fimi.cs.helsinki.fi/
Nature Publishing Group. Nature guide to authors: Statistical checklist, http://www.nature.com/nature/authors/gta/Statistical_checklist.doc
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proc of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 1–12 (2000)
Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Data Mining and Knowledge Discovery 8(1), 53–87 (2004)
Inokuchi, A., Washio, T., Motoda, H.: An apriori-based algorithm for mining frequent substructures from graph data. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 13–23. Springer, Heidelberg (2000)
Low-Kam, C., Raissi, C., Kaytoue, M., Pei, J.: Mining statistically significant sequential patterns. In: Proc. of 13th IEEE International Conference on Data Mining (ICDM 2013), pp. 488–497 (2013)
Okita, K., Ichisaka, T., Yamanaka, S.: Generation of germline-competent induced pluripotent stem cells. Nature 448, 313–317 (2007)
van der Laan, M.J., Dudoit, S.: Multiple Testing Procedures with Applications to Genomics. Springer, New York (2008)
Tatti, N.: Maximum entropy based significance of itemsets. Knowledge and Information Systems 17(1), 57–77 (2008)
Terada, A., Okada-Hatakeyama, M., Tsuda, K., Sese, J.: LAMP limitless-arity multiple-testing procedure (2013), http://a-terada.github.io/lamp/
Terada, A., Okada-Hatakeyama, M., Tsuda, K., Sese, J.: Statistical significance of combinatorial regulations. Proceedings of National Academy of Sciences of United States of America 110(32), 12996–13001 (2013)
Uno, T.: Program codes of takeaki uno, http://research.nii.ac.jp/~uno/codes.htm
Uno, T., Kiyomi, M., Arimura, H.: LCM ver.2: Efficient mining algorithms for frequent/closed/maximal itemsets. In: Proc. IEEE ICDM 2004 Workshop FIMI 2004 (International Conference on Data Mining, Frequent Itemset Mining Implementations) (2004)
Uno, T., Kiyomi, M., Arimura, H.: LCM ver.3: Collaboration of array, bitmap and prefix tree for frequent itemset mining. In: Proc. Open Source Data Mining Workshop on Frequent Pattern Mining Implementations 2005 (2005)
Uno, T., Uchida, Y., Asai, T., Arimura, H.: LCM: An efficient algorithm for enumerating frequent closed item sets. In: Proc. Workshop on Frequent Itemset Mining Implementations, FIMI 2003 (2003), http://fimi.cs.helsinki.fi/src/
Wang, J., Han, J.: Bide: Efficient mining of frequent closed sequences. In: Proc. of 4th IEEE International Conference on Data Mining (ICDM 2004), pp. 79–90 (2007)
Webb, G.I.: Discovering significant patterns. Machine Learning 68(1), 1–33 (2007)
Zaki, M.J.: Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng. 12(2), 372–390 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Minato, Si., Uno, T., Tsuda, K., Terada, A., Sese, J. (2014). A Fast Method of Statistical Assessment for Combinatorial Hypotheses Based on Frequent Itemset Enumeration. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2014. Lecture Notes in Computer Science(), vol 8725. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44851-9_27
Download citation
DOI: https://doi.org/10.1007/978-3-662-44851-9_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44850-2
Online ISBN: 978-3-662-44851-9
eBook Packages: Computer ScienceComputer Science (R0)