A Fast and Simple Method for Mining Subsequences with Surprising Event Counts

Lijffijt, Jefrey

doi:10.1007/978-3-642-40988-2_25

Jefrey Lijffijt²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8188))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

3374 Accesses
1 Citations

Abstract

We consider the problem of mining subsequences with surprising event counts. When mining patterns, we often test a very large number of potentially present patterns, leading to a high likelihood of finding spurious results. Typically, this problem grows as the size of the data increases. Existing methods for statistical testing are not usable for mining patterns in big data, because they are either computationally too demanding, or fail to take into account the dependency structure between patterns, leading to true findings going unnoticed. We propose a new method to compute the significance of event frequencies in subsequences of a long data sequence. The method is based on analyzing the joint distribution of the patterns, omitting the need for randomization. We argue that computing the p-values exactly is computationally costly, but that an upper bound is easy to compute. We investigate the tightness of the upper bound and compare the power of the test with the alternative of post-hoc correction. We demonstrate the utility of the method on two types of data: text and DNA. We show that the proposed method is easy to implement and can be computed quickly. Moreover, we conclude that the upper bound is sufficiently tight and that meaningful results can be obtained in practice.

Download to read the full chapter text

Chapter PDF

SPEck: mining statistically-significant sequential patterns efficiently with exact sampling

Article 18 June 2022

Sequential Pattern Mining

Skopus: Mining top-k sequential patterns under leverage

Article 14 June 2016

Keywords

References

Altmann, E.G., Pierrehumbert, J.B., Motter, A.E.: Beyond word frequency: Bursts, lulls, and scaling in the temporal distributions of words. PLoS One 4(11), e7678 (2009)
Google Scholar
Bernardi, G.: Isochores and the evolutionary genomics of vertebrates. Gene 241(1), 3–17 (2000)
Article Google Scholar
Church, K.W., Gale, W.A.: Poisson mixtures. Nat. Lang. Eng. 1(2), 163–190 (1995)
Article Google Scholar
De Bie, T.: Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min. Know. Disc. 23(3), 407–446 (2011)
Article MathSciNet MATH Google Scholar
Gionis, A., Mannila, H., Mielikäinen, T., Tsaparas, P.: Assessing data mining results via swap randomization. ACM TKDD 1(3), 14 (2007)
Article Google Scholar
Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Data Min. Know. Disc. 8(1), 53–87 (2004)
Article MathSciNet Google Scholar
Hanhijärvi, S.: Multiple hypothesis testing in pattern discovery. In: Elomaa, T., Hollmén, J., Mannila, H. (eds.) DS 2011. LNCS, vol. 6926, pp. 122–134. Springer, Heidelberg (2011)
Chapter Google Scholar
Hochberg, Y.: A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75(4), 800–802 (1988)
Article MathSciNet MATH Google Scholar
Katz, S.M.: Distribution of content words and phrases in text and language modelling. Nat. Lang. Eng. 2(1), 15–59 (1996)
Article Google Scholar
Kifer, D., Ben-David, S., Gehrke, J.: Detecting change in data streams. In: Nascimento, M.A., Özsu, M.T., Kossmann, D., Miller, R.J., Blakeley, J.A., Schiefer, K.B. (eds.) Proc. of VLDB, pp. 180–191. VLDB Endowment (2004)
Google Scholar
Lijffijt, J., Papapetrou, P., Puolamäki, K.: A statistical significance testing approach to mining the most informative set of patterns. Data Min. Know. Disc. (in press)
Google Scholar
Lijffijt, J., Papapetrou, P., Puolamäki, K.: Size matters: Finding the most informative set of window lengths. In: Flach, P.A., De Bie, T., Cristianini, N. (eds.) ECML PKDD 2012, Part II. LNCS, vol. 7524, pp. 451–466. Springer, Heidelberg (2012)
Chapter Google Scholar
Lijffijt, J., Papapetrou, P., Puolamäki, K., Mannila, H.: Analyzing word frequencies in large text corpora using inter-arrival times and bootstrapping. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011, Part II. LNCS, vol. 6912, pp. 341–357. Springer, Heidelberg (2011)
Chapter Google Scholar
Loader, C.: Fast and accurate computation of binomial probabilities (2000) (unpublished manuscript)
Google Scholar
Mannila, H.: Local and global methods in data mining: Basic techniques and open problems. In: Widmayer, P., Triguero, F., Morales, R., Hennessy, M., Eidenbenz, S., Conejo, R. (eds.) ICALP 2002. LNCS, vol. 2380, pp. 57–68. Springer, Heidelberg (2002)
Chapter Google Scholar
Mannila, H., Salmenkivi, M.: Finding simple intensity descriptions from event sequence data. In: Proc. of ACM SIGKDD, pp. 341–346. ACM, New York (2001)
Google Scholar
Sarkar, S.K., Chang, C.K.: The Simes method for multiple hypothesis testing with positively dependent test statistics. J. Am. Stat. Ass. 92(440), 1601–1608 (1997)
Article MathSciNet MATH Google Scholar
Shaffer, J.P.: Multiple hypothesis testing. Ann. Rev. Psych. 46, 561–584 (1995)
Article Google Scholar
Spärck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11–21 (1972)
Article Google Scholar
Webb, G.I.: Layered critical values: A powerful direct-adjustment approach to discovering significant patterns. Mach. Learn. 71(2-3), 307–323 (2008)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Finland
Jefrey Lijffijt

Authors

Jefrey Lijffijt
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, 3001, Leuven, Belgium
Hendrik Blockeel
Fraunhofer IAIS, Department of Knowledge Discovery, University of Bonn, Schloss Birlinghoven, 53754, Sankt Augustin, Germany
Kristian Kersting
LIACS, Universiteit Leiden, Niels Bohrweg 1, 2333 CA, Leiden, The Netherlands
Siegfried Nijssen
Department of Computer Science and Engineering, Czech Technical University, Technicka 2, 16627, Prague 6, Czech Republic
Filip Železný

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lijffijt, J. (2013). A Fast and Simple Method for Mining Subsequences with Surprising Event Counts. In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2013. Lecture Notes in Computer Science(), vol 8188. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40988-2_25

Download citation

DOI: https://doi.org/10.1007/978-3-642-40988-2_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40987-5
Online ISBN: 978-3-642-40988-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Fast and Simple Method for Mining Subsequences with Surprising Event Counts

Abstract

Chapter PDF

Similar content being viewed by others

SPEck: mining statistically-significant sequential patterns efficiently with exact sampling

Sequential Pattern Mining

Skopus: Mining top-k sequential patterns under leverage

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Fast and Simple Method for Mining Subsequences with Surprising Event Counts

Abstract

Chapter PDF

Similar content being viewed by others

SPEck: mining statistically-significant sequential patterns efficiently with exact sampling

Sequential Pattern Mining

Skopus: Mining top-k sequential patterns under leverage

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation