Data Mining and Knowledge Discovery

, Volume 29, Issue 6, pp 1838–1864 | Cite as

Size matters: choosing the most informative set of window lengths for mining patterns in event sequences

  • Jefrey Lijffijt
  • Panagiotis Papapetrou
  • Kai Puolamäki
Article

Abstract

In order to find patterns in data, it is often necessary to aggregate or summarise data at a higher level of granularity. Selecting the appropriate granularity is a challenging task and often no principled solutions exist. This problem is particularly relevant in analysis of data with sequential structure. We consider this problem for a specific type of data, namely event sequences. We introduce the problem of finding the best set of window lengths for analysis of event sequences for algorithms with real-valued output. We present suitable criteria for choosing one or multiple window lengths and show that these naturally translate into a computational optimisation problem. We show that the problem is NP-hard in general, but that it can be approximated efficiently and even analytically in certain cases. We give examples of tasks that demonstrate the applicability of the problem and present extensive experiments on both synthetic data and real data from several domains. We find that the method works well in practice, and that the optimal sets of window lengths themselves can provide new insight into the data.

Keywords

Event sequence Pattern mining Window length Output-space clustering Exploratory data analysis 

Notes

Acknowledgments

We thank Heikki Mannila for useful discussions and feedback. This work was supported by the the Finnish Doctoral Programme in Computational Sciences (FICS), the Finnish Centre of Excellence for Algorithmic Data Analysis Research (ALGODAN) and the Finnish Centre of Excellence in Computational Inference Research (COIN). We acknowledge the computational resources provided by Aalto Science-IT project.

References

  1. Aloise D, Deshpande A, Hansen P, Popat P (2009) NP-hardness of Euclidean sum-of-squares clustering. Mach Learn 75:245–248CrossRefGoogle Scholar
  2. Altmann EG, Pierrehumbert JB, Motter AE (2009) Beyond word frequency: bursts, lulls, and scaling in the temporal distributions of words. PLoS ONE 4(11):e7678CrossRefGoogle Scholar
  3. Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of SODAGoogle Scholar
  4. Benson G (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27(2):573–580MathSciNetCrossRefGoogle Scholar
  5. Biber D (1988) Variation across speech and writing. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  6. Bourgain C, Genin E, Quesneville H, Clerget-Darpoux F (2000) Search for multifactorial disease susceptibility genes in founder populations. Ann Hum Genet 64(3):255–265CrossRefGoogle Scholar
  7. Calders T, Dexters N, Goethals B (2008) Mining frequent items in a stream using flexible windows. Intell Data Anal 12(3):293–304Google Scholar
  8. Chiu B, Keogh E, Lonardi S (2003) Probabilistic discovery of time series motifs. In: Proceedings of ACM SIGKDD, pp 493–498Google Scholar
  9. Das MK, Dai HK (2007) A survey of DNA motif finding algorithms. BMC Bioinform 8(Suppl 7):S21CrossRefGoogle Scholar
  10. Demaine ED, López-Ortiz A, Munro JI (2002) Frequency estimation of internet packet streams with limited space. In: Proceedings of ESA, pp 348–360Google Scholar
  11. Giannella C, Han J, Robertson E, Liu C (2003) Mining frequent itemsets over arbitrary time intervals in data streams. Technical Report TR587, Indiana UniversityGoogle Scholar
  12. Golab L, DeHaan D, Demaine ED, López-Ortiz A, Munro JI (2003) Identifying frequent items in sliding windows over on-line packet streams. In: Proceedings of IMC, pp 173–178Google Scholar
  13. Gries ST (2008) Dispersions and adjusted frequencies in corpora. Int J Corpus Linguist 13(4):403–437CrossRefGoogle Scholar
  14. Jin C, Yi K, Chen L, Yu JX, Lin X (2010) Sliding-window top-k queries on uncertain streams. VLDB J 19:411–435CrossRefGoogle Scholar
  15. Jin R, Agrawal G (2005) An algorithm for in-core frequent itemset mining on streaming data. In: Proceedings of IEEE ICDM, pp 210–217Google Scholar
  16. Karp RM, Shenker S, Papadimitriou CH (2003) A simple algorithm for finding frequent elements in streams and bags. ACM Trans Database Syst 28(1):51–55CrossRefGoogle Scholar
  17. Katz SM (1996) Distribution of content words and phrases in text and language modelling. Nat Lang Eng 2(1):15–59CrossRefGoogle Scholar
  18. Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New YorkCrossRefGoogle Scholar
  19. Kirkness EF, Bafna V, Halpern AL, Levy S, Remington K, Rusch DB, Delcher AL, Pop M, Wang W, Fraser CM, Venter JC (2003) The dog genome: survey sequencing and comparative analysis. Science 301(5641):1898–1903CrossRefGoogle Scholar
  20. Knobbe A, Blockeel H, Koopman A, Calders T, Obladen B, Bosma C, Galenkamp H, Koenders E, Kok J (2010) Infrawatch: data management of large systems for monitoring infrastructural performance. In: Proceedings of IDA, pp 91–102Google Scholar
  21. Lee DYW (2001) Genres, registers, text types, domains and styles: clarifying the concepts and navigating a path through the BNC jungle. Lang Learn Technol 5(3):37–72Google Scholar
  22. Li C, Wang B, Yang X (2007a) VGRAM: improving performance of approximate queries on string collections using variable-length grams. In: Proceedings of VLDB, pp 303–314Google Scholar
  23. Li Y, Sung WK, Liu JJ (2007b) Association mapping via regularized regression analysis of single-nucleotide-polymorphism haplotypes in variable-sized sliding windows. Am J Hum Genet 80(4):705–715CrossRefGoogle Scholar
  24. Li Y, Lin J, Oates T (2012) Visualizing variable-length time series motifs. In: Proceedings of SDM, pp 895–906Google Scholar
  25. Lijffijt J, Papapetrou P, Puolamäki K, Mannila H (2011) Analyzing word frequencies in large text corpora using inter-arrival times and bootstrapping. In: Proceedings of ECML-PKDD, pp 341–357Google Scholar
  26. Lijffijt J, Papapetrou P, Puolamäki K (2012) Size matters: finding the most informative set of window lengths. In: Proceedings of ECML-PKDD, pp 451–466Google Scholar
  27. Lin CH, Chiu DY, Wu YH, Chen ALP (2005) Mining frequent itemsets from data streams with a time-sensitive sliding window. In: Proceedings of SDMGoogle Scholar
  28. Madsen RE, Kauchak D, Elkan C (2005) Modeling word burstiness using the dirichlet distribution. In: Proceedings of ICML, pp 545–552Google Scholar
  29. Mannila H, Toivonen H, Verkamo AI (1997) Discovery of frequent episodes in event sequences. Data Min Knowl Discov 1(3):259–289CrossRefGoogle Scholar
  30. Mathias RA, Gao P, Goldstein JL, Wilson AF, Pugh EW, Furbert-Harris P, Dunston GM, Malveaux FJ, Togias A, Barnes KC, Beaty TH, Huang SK (2006) A graphical assessment of p-values from sliding window haplotype tests of association to identify asthma susceptibility loci on chromosome 11q. BMC Genet 7:38CrossRefGoogle Scholar
  31. Mueen A (2013) Enumeration of time series motifs of all lengths. In: Proceedings of ICDM, pp 547–556Google Scholar
  32. Mueen A, Keogh EJ, Zhu Q, Cash S, Westover B (2009) Exact discovery of time series motifs. In: Proceedings of SDM, pp 473–484Google Scholar
  33. Pakhira MK (2008) Fast image segmentation using modified CLARA algorithm. In: Proceedings of ICIT, pp 14–18Google Scholar
  34. Papadimitriou S, Yu P (2006) Optimal multi-scale patterns in time series streams. In: Proceedings of ACM SIGMOD, pp 647–658Google Scholar
  35. Papapetrou P, Benson G, Kollios G (2006) Discovering frequent poly-regions in DNA sequences. In: Proceedings of IEEE ICDM workshops, pp 94–98Google Scholar
  36. Papapetrou P, Benson G, Kollios G (2012) Mining poly-regions in DNA sequences. Int J Data Min Bioinform (IJDMB) 6(4):406–428CrossRefGoogle Scholar
  37. Sörnmo L, Laguna P (2005) Bioelectrical signal processing in cardiac and neurological applications. Elsevier Academic Press, AmsterdamGoogle Scholar
  38. Tang R, Feng T, Sha Q, Zhang S (2009) A variable-sized sliding-window approach for genetic association studies via principal component analysis. Ann Hum Genet 73(Pt 6):631–637CrossRefGoogle Scholar
  39. The British National Corpus (2007) Version 3 (BNC XML Edition). Distributed by Oxford University Computing Services on behalf of the BNC Consortium http://www.natcorp.ox.ac.uk/
  40. Toivonen H, Onkamo P, Vasko K, Ollikainen V, Sevon P, Mannila H, Herr M, Kere J (2000) Data mining applied to linkage disequilibrium mapping. Am J Hum Genet 67(1):133–145CrossRefGoogle Scholar
  41. Vespier U, Knobbe A, Nijssen S, Vanschoren J (2012) MDL-based analysis of time series at multiple time-scales. In: Proceedings of ECML-PKDD, pp 371–386Google Scholar
  42. Yingchareonthawornchai S, Sivaraks H, Rakthanmanon T, Ratanamahatana CA (2013) Efficient proper length time series motif discovery. In: Proceedings of ICDM, pp 1265–1270Google Scholar

Copyright information

© The Author(s) 2014

Authors and Affiliations

  • Jefrey Lijffijt
    • 1
    • 2
  • Panagiotis Papapetrou
    • 3
  • Kai Puolamäki
    • 4
  1. 1.Department of Engineering MathematicsUniversity of BristolBristolUK
  2. 2.Department of Information and Computer ScienceAalto UniversityEspooFinland
  3. 3.Department of Computer and Systems SciencesStockholm UniversityKistaSweden
  4. 4.Finnish Institute of Occupational HealthHelsinkiFinland

Personalised recommendations