Skip to main content

Compression-based data mining of sequential data

Abstract

The vast majority of data mining algorithms require the setting of many input parameters. The dangers of working with parameter-laden algorithms are twofold. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorithm may report spurious patterns that do not really exist, or greatly overestimate the significance of the reported patterns. This is especially likely when the user fails to understand the role of parameters in the data mining process. Data mining algorithms should have as few parameters as possible. A parameter-light algorithm would limit our ability to impose our prejudices, expectations, and presumptions on the problem at hand, and would let the data itself speak to us. In this work, we show that recent results in bioinformatics, learning, and computational theory hold great promise for a parameter-light data-mining paradigm. The results are strongly connected to Kolmogorov complexity theory. However, as a practical matter, they can be implemented using any off-the-shelf compression algorithm with the addition of just a dozen lines of code. We will show that this approach is competitive or superior to many of the state-of-the-art approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/XML/video datasets. As a further evidence of the advantages of our method, we will demonstrate its effectiveness to solve a real world classification problem in recommending printing services and products.

This is a preview of subscription content, access via your institution.

References

  1. Allison L, Stern L, Edgoose T, Dix TI (2000) Sequence complexity for biological sequence analysis. Comput Chem 24(1):43–55

    Google Scholar 

  2. Baronchelli A, Caglioti E, Loreto V (2005) Artificial sequences and complexity measures. J. Stat. Mech: Theory and Exp, Issue 04, P04002

  3. Benedetto D, Caglioti E, Loreto V (2002) Language trees and zipping. Phys Rev Lett 88: 048702

    Article  Google Scholar 

  4. Chakrabarti D, Papadimitriou S, Modha D, Faloutsos C (2004) Fully automatic cross-assocations, In: Proceedings of the KDD 2004, Seattle, WA

  5. Christen P, Goiser K (2005) Towards automated data linkage and deduplication. Tech Report, Australian National University

    Google Scholar 

  6. Cook D, Holder LB (2000) Graph-based data mining. IEEE Intell Syst 15(2):32–41

    Article  Google Scholar 

  7. Dasgupta D, Forrest S (1999) Novelty detection in time series data using ideas from immunology. In: Proc. of the international conference on intelligent systems, Heidelberg, Germany

  8. Domingos P (1998) A process-oriented heuristic for model selection. In: Machine learning Proc. of the fifteenth international conference,. Morgan Kaufmann Publishers, San Francisco, CA, pp 27–135

  9. Elkan, C (2001) Magical thinking in data mining: lessons from CoIL challenge 2000. In Proc. of SIGKDD 2001, San Francisco, CA, USA, pp 426–431

  10. Elkan C (2003) Using the triangle inequality to accelerate k-Means. In: Proc. of ICML 2003, Washington DC, USA, pp 147–153

  11. Faloutsos C, Lin K (1995) FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proc. of 24th ACM SIGMOD, San Jose, CA, USA

  12. Farach M, Noordewier M, Savari S, Shepp L, Wyner A, Ziv J (1995) On the entropy of DNA: algorithms and measurements based on memory and rapid convergence. In: Proc. of the symp. on discrete algorithms, San Francisco, CA, USA pp 48-57

  13. Ferrandina F, Meyer T, Zicari R (1994) Implementing lazy database updates for an object database system. In: Proc. of the 20 international conference on very large databases, Santiago de Chile, Chile, pp 261–272

  14. Flexer A (1996) Statistical evaluation of neural networks experiments: minimum requirements and current practice. In: Proc. of the 13th european meeting on cybernetics and systems research, vol. 2, Austria, pp 1005–1008

  15. Frank E, Chui C, Witten I (2000) Text categorization using compression models. In: Proc. of the IEEE data compression conference, Snowbird, Utah, IEEE Comput Soc p555

  16. Gaussier E, Goutte C, Popat K, Chen F (2002) A hierarchical model for clustering and categorising documents source lecture notes in computer science; Vol. 2291 archive Proceedings of the 24th BCS-IRSG european colloquium on IR research: advances in information retrieval, Glasgow, UK

  17. Gatlin L (1972) Information theory and the living systems. Columbia University Press, columbia

    Google Scholar 

  18. Gavrilov M, Anguelov D, Indyk P, Motwahl R (2000) Mining the stock market: which measure is best? In: Proc. of the 6th ACM SIGKDD, 2000, Boston, MA, USA

  19. Ge X, Smyth P (2000) Deformable Markov model templates for time-series pattern matching. In: Proc. of the 6th ACM SIGKDD, Boston, MA, pp 81–90

  20. Goldberger A.L, Amaral L, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng CK, Stanley HE (2000) PhysioBank, physioToolkit, and physioNet: components of a new research resource for complex physiologic signals. Circulation 101(23):e215–e220

    Google Scholar 

  21. Kalpakis K, Gada D, Puttagunta V (2001) Distance measures for effective clustering of ARIMA time-series. In: Proceedings of the 1st IEEE ICDM, San Jose, CA, pp 273-280

  22. Kennel M (2004) Testing time symmetry in time series using data compression dictionaries. Phys Rev E 69; 056208

  23. Keogh E. http://www.cs.ucr.edu/∼eamonn/SIGKDD2004, University of California, Riverside

  24. Keogh E, Folias T (2002) The UCR time series data mining archive. University of California, Riverside CA [http://www.cs.ucr.edu/∼eamonn/TSDMA/index.html]

    Google Scholar 

  25. Keogh E, Kasetty S (2002) On the need for time series data mining benchmarks: a survey and empirical demonstration. In: Proc. of SIGKDD, Edmonton, Alberta, Canada

  26. Keogh E, Lin J, Truppel W (2003) Clustering of time series subsequences is meaningless: implications for past and future research. In: Proc. of the 3rd IEEE ICDM, Melbourne, FL, pp 115–122

  27. Kit C (1998) A goodness measure for phrase learning via compression with the MDL principle. In: Kruijff-Korbayova I (ed) The ELLSSI-98 student session, Chapt 13, Saarbrueken, pp 175–187

  28. Li M, Badger JH, Chen X, Kwong S, Kearney P, Zhang H (2001) An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17:149–154

    Article  Google Scholar 

  29. Li M, Chen X, Li X, Ma B, Vitanyi, P (2003) The similarity metric. In: Proc. of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, Baltimore, MD, USA, pp 863–872

  30. Li M, Vitanyi P (1997) An introduction to kolmogorov complexity and its applications, 2nd edn, Springer Verlag, Berlin

  31. Lin J, Keogh E, Lonardi S, Chiu B (2003) A symbolic representation of time series, with implications for streaming algorithms. In: Proc. of the 8th ACM SIGMOD workshop on research issues in data mining and knowledge discovery, San Diego, CA

  32. Loewenstern D, Hirsh H, Yianilos P, Noordewier M (1995) DNA sequence classification using compression-based induction, DIMACS Technical Report 95-04

  33. Loewenstern D, Yianilos PN (1999) Significantly lower entropy estimates for natural DNA sequences, J Comput Biol 6(1)

  34. Ma J, Perkins S (2003) Online novelty detection on temporal sequences. In: Proc. international conference on knowledge discovery and data mining, Washington, DC

  35. Mahoney M, Chan P (2005) Learning rules for time series anomaly detection. SensorMiner Tech report (available at [www.interfacecontrol.com/products/sensorMiner/])

  36. Mehta M, Rissanen J, Agrawal R (1995) MDL-based decision tree pruning, In: Proceedings of the first international conference on knowledge discovery and data mining (KDD’95), Montreal, Canada

  37. Needham S, Dowe D(2001) Message length as an effective ockham’s razor in decision tree induction, In: Proc. 8th international workshop on AI and statistics, Key West, FL, USA, pp 253–260

  38. Ortega A, Beferull-Lozano B, Srinivasamurthy N, Xie H (2000) Compression for recognition and content based retrieval. In: Proc. of the European signal processing conference, EUSIPCO’00, Tampere, Finland

  39. Papadimitriou S, Gionis A, Tsaparas P, Väisänen A, Mannila H, Faloutsos C (2005) Parameter-free spatial data mining using MDL, In: Proc of the 5th International Conference on Data Mining (ICDM), Houston, TX, USA

  40. Quinlan JR, Rivest RL (1989) Inferring decision trees using the minimum description length principle. Infor Comput 80:227–248

    MATH  Article  MathSciNet  Google Scholar 

  41. Ratanamahatana CA, Keogh E (2004) Making time-series classification more accurate using learned constraints. In: Proc. of SIAM international conference on data mining (SDM ’04), Lake Buena Vista, Florida

  42. Rissanen J (1978) Modeling by shortest data description. Automatica, 14:465–471

    MATH  Article  Google Scholar 

  43. Salzberg SL (1997) On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Min Knowl Disc 1(3):317–328

    Article  Google Scholar 

  44. Segen J (1990) Graph clustering and model learning by data compression. In: Proc. of the machine learning conference, Austin, TX, USA, pp 93–101

  45. Sculley D, Brodley CE (2006) Compression and machine learning: a new perspective on feature space vectors, In: Proceedings of data compression conference, Snowbird, UT, USA, pp 332–341

  46. Shahabi C, Tian X, Zhao W (2000) TSA-tree: a wavelet-based approach to improve the efficiency of multi-level surprise and trend queries. In: Proc. of the 12th Int’l conference on scientific and statistical database management (SSDBM 2000), Berlin, Germany

  47. Teahan WJ, Wen Y, McNab RJ, Witten IH (2000) A compression-based algorithm for Chinese word segmentation. Comput Linguist 26:375–393

    Article  Google Scholar 

  48. Vlachos M, Hadjieleftheriou M, Gunopulos D, Keogh E (2003) Indexing multi-dimensional time-series with support for multiple distance measures. In: Proc. of the 9th ACM SIGKDD, Washington, DC, USA, pp 216–225

  49. Wallace C, Boulton (1968) An information measure for classification. Comput J 11 (2):185–194

    MATH  Google Scholar 

  50. Yairi T, Kato Y, Hori K (2001) Fault detection by mining association rules from house-keeping data. In: Proc. of Int’l sym. on AI, Robotics and Automation in Space

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Eamonn Keogh.

Additional information

Responsible editor: Johannes Gehrke

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Keogh, E., Lonardi, S., Ratanamahatana, C.A. et al. Compression-based data mining of sequential data. Data Min Knowl Disc 14, 99–129 (2007). https://doi.org/10.1007/s10618-006-0049-3

Download citation

Keywords

  • Kolmogorov complexity
  • Parameter-free data mining
  • Anomaly detection
  • Clustering