A Fast Method of Statistical Assessment for Combinatorial Hypotheses Based on Frequent Itemset Enumeration

  • Shin-ichi Minato
  • Takeaki Uno
  • Koji Tsuda
  • Aika Terada
  • Jun Sese
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8725)

Abstract

In many scientific communities using experiment databases, one of the crucial problems is how to assess the statistical significance (p-value) of a discovered hypothesis. Especially, combinatorial hypothesis assessment is a hard problem because it requires a multiple-testing procedure with a very large factor of the p-value correction. Recently, Terada et al. proposed a novel method of the p-value correction, called “Limitless Arity Multiple-testing Procedure” (LAMP), which is based on frequent itemset enumeration to exclude meaninglessly infrequent itemsets which will never be significant. The LAMP makes much more accurate p-value correction than previous method, and it empowers the scientific discovery. However, the original LAMP implementation is sometimes too time-consuming for practical databases. We propose a new LAMP algorithm that essentially executes itemset mining algorithm once, while the previous one executes many times. Our experimental results show that the proposed method is much (10 to 100 times) faster than the original LAMP. This algorithm enables us to discover significant p-value patterns in quite short time even for very large-scale databases.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: Buneman, P., Jajodia, S. (eds.) Proc. of the 1993 ACM SIGMOD International Conference on Management of Data. SIGMOD Record, vol. 22(2), pp. 207–216 (1993)Google Scholar
  2. 2.
    Arimura, H., Uno, T.: A polynomial space and polynomial delay algorithm for enumeration of maximal motifs in a sequence. In: Deng, X., Du, D.-Z. (eds.) ISAAC 2005. LNCS, vol. 3827, pp. 724–737. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  3. 3.
    Arimura, H., Uno, T., Shimozono, S.: Time and space efficient discovery of maximal geometric graphs. In: Corruble, V., Takeda, M., Suzuki, E. (eds.) DS 2007. LNCS (LNAI), vol. 4755, pp. 42–55. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  4. 4.
    Asai, T., Abe, K., Kawasoe, S., Sakamoto, H., Arimura, H., Arikawa, S.: Efficient substructure discovery from large semi-structured data. IEICE Trans. on Information and Systems E87-D(12), 2754–2763 (2004)Google Scholar
  5. 5.
    Asai, T., Arimura, H., Uno, T., Nakano, S.-i.: Discovering frequent substructures in large unordered trees. In: Grieser, G., Tanaka, Y., Yamamoto, A. (eds.) DS 2003. LNCS (LNAI), vol. 2843, pp. 47–61. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  6. 6.
    Boley, M., Gärtner, T., Grosskreutz, H., Fraunhofer, I.: Formal concept sampling for counting and threshold-free local pattern mining. In: Proc. of 2010 SIAM International Conference on Data Mining (SDM 2010), pp. 177–188 (April 2010)Google Scholar
  7. 7.
    Bonferroni, C.: Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze (Libreria Internazionale Seeber, Florence, Italy), 8: Article 3–62 (1936)Google Scholar
  8. 8.
    Gallo, A., De Bie, T., Cristianini, N.: MINI: Mining informative non-redundant itemsets. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 438–445. Springer, Heidelberg (2007)Google Scholar
  9. 9.
    Goethals, B.: Survey on frequent pattern mining (2003), http://www.cs.helsinki.fi/u/goethals/publications/survey.ps
  10. 10.
    Goethals, B., Zaki, M.J.: Frequent itemset mining dataset repository. In: Frequent Itemset Mining Implementations, FIMI 2003 (2003), http://fimi.cs.helsinki.fi/
  11. 11.
    Nature Publishing Group. Nature guide to authors: Statistical checklist, http://www.nature.com/nature/authors/gta/Statistical_checklist.doc
  12. 12.
    Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proc of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 1–12 (2000)Google Scholar
  13. 13.
    Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Data Mining and Knowledge Discovery 8(1), 53–87 (2004)CrossRefMathSciNetGoogle Scholar
  14. 14.
    Inokuchi, A., Washio, T., Motoda, H.: An apriori-based algorithm for mining frequent substructures from graph data. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 13–23. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  15. 15.
    Low-Kam, C., Raissi, C., Kaytoue, M., Pei, J.: Mining statistically significant sequential patterns. In: Proc. of 13th IEEE International Conference on Data Mining (ICDM 2013), pp. 488–497 (2013)Google Scholar
  16. 16.
    Okita, K., Ichisaka, T., Yamanaka, S.: Generation of germline-competent induced pluripotent stem cells. Nature 448, 313–317 (2007)CrossRefGoogle Scholar
  17. 17.
    van der Laan, M.J., Dudoit, S.: Multiple Testing Procedures with Applications to Genomics. Springer, New York (2008)Google Scholar
  18. 18.
    Tatti, N.: Maximum entropy based significance of itemsets. Knowledge and Information Systems 17(1), 57–77 (2008)CrossRefGoogle Scholar
  19. 19.
    Terada, A., Okada-Hatakeyama, M., Tsuda, K., Sese, J.: LAMP limitless-arity multiple-testing procedure (2013), http://a-terada.github.io/lamp/
  20. 20.
    Terada, A., Okada-Hatakeyama, M., Tsuda, K., Sese, J.: Statistical significance of combinatorial regulations. Proceedings of National Academy of Sciences of United States of America 110(32), 12996–13001 (2013)CrossRefMATHMathSciNetGoogle Scholar
  21. 21.
    Uno, T.: Program codes of takeaki uno, http://research.nii.ac.jp/~uno/codes.htm
  22. 22.
    Uno, T., Kiyomi, M., Arimura, H.: LCM ver.2: Efficient mining algorithms for frequent/closed/maximal itemsets. In: Proc. IEEE ICDM 2004 Workshop FIMI 2004 (International Conference on Data Mining, Frequent Itemset Mining Implementations) (2004)Google Scholar
  23. 23.
    Uno, T., Kiyomi, M., Arimura, H.: LCM ver.3: Collaboration of array, bitmap and prefix tree for frequent itemset mining. In: Proc. Open Source Data Mining Workshop on Frequent Pattern Mining Implementations 2005 (2005)Google Scholar
  24. 24.
    Uno, T., Uchida, Y., Asai, T., Arimura, H.: LCM: An efficient algorithm for enumerating frequent closed item sets. In: Proc. Workshop on Frequent Itemset Mining Implementations, FIMI 2003 (2003), http://fimi.cs.helsinki.fi/src/
  25. 25.
    Wang, J., Han, J.: Bide: Efficient mining of frequent closed sequences. In: Proc. of 4th IEEE International Conference on Data Mining (ICDM 2004), pp. 79–90 (2007)Google Scholar
  26. 26.
    Webb, G.I.: Discovering significant patterns. Machine Learning 68(1), 1–33 (2007)CrossRefGoogle Scholar
  27. 27.
    Zaki, M.J.: Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng. 12(2), 372–390 (2000)CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Shin-ichi Minato
    • 1
    • 2
  • Takeaki Uno
    • 3
  • Koji Tsuda
    • 4
    • 5
    • 2
  • Aika Terada
    • 6
  • Jun Sese
    • 6
  1. 1.Graduate School of Information Science and TechnologyHokkaido UniversitySapporoJapan
  2. 2.JST ERATO Minato Discrete Structure Manipulation System ProjectSapporoJapan
  3. 3.National Institute of InformaticsTokyoJapan
  4. 4.Graduate School of Frontier SciencesThe University of TokyoKashiwaJapan
  5. 5.Computational Biology Research Center, National Institute of Advanced Industrial Science and TechnologyTokyoJapan
  6. 6.Department of Computer ScienceOchanomizu UniversityTokyoJapan

Personalised recommendations