Skip to main content

Finding Segmentations of Sequences

  • Chapter
  • First Online:
Book cover Inductive Databases and Constraint-Based Data Mining

Abstract

We describe a collection of approaches to inductive querying systems for data that contain segmental structure. The main focus in this chapter is on work done in Helsinki area in 2004-2008. Segmentation is a general data mining technique for summarizing and analyzing sequential data. We first introduce the basic problem setting and notation.We then briefly present an optimal way to accomplish the segmentation, in the case of no added constraints. The challenge, however, lies in adding constraints that relate the segments to each other and make the end result more interpretable for the human eye, and/or make the computational task simpler. We describe various approaches to segmentation, ranging from efficient algorithms to added constraints and modifications to the problem. We also discuss topics beyond the basic task of segmentation, such as whether an output of a segmentation algorithm is meaningful or not, and touch upon some applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Miriam Ayer, H. D. Brunk, G. M. Ewing, W. T. Reid, and Edward Silverman. An empirical distribution function for sampling with incomplete information. Annals of Mathematical Statistics, 26(4):641–647, 1955.

    Article  MATH  MathSciNet  Google Scholar 

  2. Richard Bellman. On the approximation of curves by line segments using dynamic programming. Communications of the ACM, 4(6), 1961.

    Google Scholar 

  3. K.D. Bennett. Determination of the number of zones in a biostratigraphical sequence. New Phytologist, 132(1):155–170, 1996.

    Article  Google Scholar 

  4. Pedro Bernaola-Galván, Ramón Román-Roldán, and José L. Oliver. Compositional segmentation and long-range fractal correlations in dna sequences. Phys. Rev. E, 53(5):5181–5189, 1996.

    Article  Google Scholar 

  5. Ella Bingham, Aristides Gionis, Niina Haiminen, Heli Hiisilä, Heikki Mannila, and Evimaria Terzi. Segmentation and dimensionality reduction. In 2006 SIAM Conference on Data Mining, pages 372–383, 2006.

    Google Scholar 

  6. Harmen J. Bussemaker, Hao Li, and Eric D. Siggia. Regulatory element detection using a probabilistic segmentation model. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pages 67–74, 2000.

    Google Scholar 

  7. A. Cantoni. Optimal curve fitting with piecewise linear functions. IEEE Transactions on Computers, C-20(1):59–67, 1971.

    Article  MathSciNet  Google Scholar 

  8. K. Chakrabarti, E. Keogh, S. Mehrotra, and M. J. Pazzani. Locally adaptive dimensionality reduction for indexing large time series databases. ACM Transactions on Database Systems, 27(2):188–228, 2002.

    Article  Google Scholar 

  9. G.A. Churchill. Stochastic models for heterogenous dna sequences. Bulletin of Mathematical Biology, 51(1):79–94, 1989.

    MATH  MathSciNet  Google Scholar 

  10. Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley, 1991.

    Google Scholar 

  11. David Douglas and Thomas Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Canadian Cartographer, 10(2):112–122, 1973.

    Google Scholar 

  12. Sorabh Gandhi, Luca Foschini, and Subhash Suri. Space-efficient online approximation of time series data: Streams, amnesia, and out-of-order. In Proceedings of the 26th IEEE International Conference on Data Engineering (ICDE), 2010.

    Google Scholar 

  13. Aristides Gionis and Heikki Mannila. Finding recurrent sources in sequences. In Proceedings of the Sventh Annual International Conference on Computational Biology (RECOMB 2003), 2003.

    Google Scholar 

  14. Aristides Gionis, Heikki Mannila, Taneli Mielikäinen, and Panayiotis Tsaparas. Assessing data mining results via swap randomization. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(3), 2007. Article No. 14.

    Google Scholar 

  15. Aristides Gionis and Evimaria Terzi. Segmentations with rearrangements. In SIAM Data Mining Conference (SDM) 2007, 2007.

    Google Scholar 

  16. S. Guha, N. Koudas, and K. Shim. Data-streams and histograms. In Symposium on the Theory of Computing (STOC), pages 471–475, 2001.

    Google Scholar 

  17. Niina Haiminen. Mining sequential data — in search of segmental structure. PhD Thesis, Department of Computer Science, University of Helsinki, March 2008.

    Google Scholar 

  18. Niina Haiminen and Aristides Gionis. Unimodal segmentation of sequences. In ICDM ’04: Proceedings of the Fourth IEEE International Conference on Data Mining, pages 106–113, 2004.

    Google Scholar 

  19. Niina Haiminen and Heikki Mannila. Evaluation of BIC and cross validation for model selection on sequence segmentations. International Journal of Data Mining and Bioinformatics. In press.

    Google Scholar 

  20. Niina Haiminen, Heikki Mannila, and Evimaria Terzi. Comparing segmentations by applying randomization techniques. BMC Bioinformatics, 8(171), 23 May 2007.

    Google Scholar 

  21. Niina Haiminen, Heikki Mannila, and Evimaria Terzi. Determining significance of pairwise co-occurrences of events in bursty sequences. BMC Bioinformatics, 9:336, 2008.

    Article  Google Scholar 

  22. Trevor Hastie, R. Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer, 2001.

    Google Scholar 

  23. J. Himberg, K. Korpiaho, H. Mannila, J. Tikanmäki, and H. T.T. Toivonen. Time series segmentation for context recognition in mobile devices. In Proceedings of the 2001 IEEE International Conference on Data Mining, pages 203–210, 2001.

    Google Scholar 

  24. Dorit S. Hochbaum. Heuristics for the fixed cost median problem. Mathematical Programming, 22(1):148–162, 1982.

    Article  MATH  MathSciNet  Google Scholar 

  25. Saara Hyvönen, Aristides Gionis, and Heikki Mannila. Recurrent predictive models for sequence segmentation. In The 7th International Symposium on Intelligent Data Analysis, Lecture Notes in Computer Science. Springer, 2007.

    Google Scholar 

  26. Eamonn Keogh, Selina Chu, David Hart, and Michael Pazzani. An online algorithm for segmenting time series. In Proceedings of the 2001 IEEE International Conference on Data Mining, pages 289–296, 2001.

    Google Scholar 

  27. Eamonn Keogh and S. Kasetty. On the need for time series data mining benchmarks: A survey and empirical demonstration. In Proceedings of the ACM SIGKDD ’02, pages 102–111, July 2002.

    Google Scholar 

  28. Eamonn Keogh and Michael J. Pazzani. An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In Proceedings of the ACM SIGKDD ’98, pages 239–243, August 1998.

    Google Scholar 

  29. Victor Lavrenko, Matt Schmill, Dawn Lawrie, Paul Ogilvie, David Jensen, and James Allan. Mining of concurrent text and time series. In In proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Workshop on Text Mining, pages 37–44, 2000.

    Google Scholar 

  30. W. Li. DNA segmentation as a model selection process. In Proceedings of the Fifth Annual International Conference on Computational Biology (RECOMB 2001), pages 204 – 210, 2001.

    Google Scholar 

  31. Jyh-Han Lin and Jeffrey Scott Vitter. ε-approximations with minimum packing constraint violation. In Proc. ACM Symposium on Theory of Computing (STOC’92), pages 771–781, 1992.

    Google Scholar 

  32. Jun S. Liu and Charles E. Lawrence. Bayesian inference on biopolymer models. Bioinformatics, 15(1):38–52, 1999.

    Article  Google Scholar 

  33. Taneli Mielikäinen, Evimaria Terzi, and Panayiotis Tsaparas. Aggregating time partitions. In The Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2006), pages 347–356, 2006.

    Google Scholar 

  34. Markus Ojala, Niko Vuokko, Aleksi Kallio, Niina Haiminen, and Heikki Mannila. Randomization of real-valued matrices for assessing the significance of data mining results. In Proc. SIAM Data Mining Conference (SDM’08), pages 494–505, 2008.

    Google Scholar 

  35. T. Palpanas, M. Vlachos, E. Keogh, D. Gunopulos, and W. Truppel. Online amnesic approximation of streaming time series. In ICDE 2004: Proceedings of the 20th International Conference on Data Engineering, pages 338–349, 2004.

    Google Scholar 

  36. Themis Palpanas, Michail Vlachos, Eamonn Keogh, and Dimitrios Gunopulos. Streaming time series summarization using user-defined amnesic functions. IEEE Transactions on Knowledge and Data Engineering, 20(7):992–1006, 2008.

    Article  Google Scholar 

  37. V.E. Ramensky, V.J. Makeev, M.A. Roytberg, and V.G. Tumanyan. DNA segmentation through the Bayesian approach. Journal of Computational Biology, 7(1-2):215–231, 2000.

    Article  Google Scholar 

  38. Marko Salmenkivi, Juha Kere, and Heikki Mannila. Genome segmentation using piecewise constant intensity models and reversible jump MCMC. Bioinformatics (European Conference on Computational Biology), 18(2):211–218, 2002.

    Google Scholar 

  39. G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464, 1978.

    Article  MATH  MathSciNet  Google Scholar 

  40. Hagit Shatkay and Stanley B. Zdonik. Approximate queries and representations for large data sequences. In ICDE ’96: Proceedings of the Twelfth International Conference on Data Engineering, pages 536–545, 1996.

    Google Scholar 

  41. P. Smyth. Model selection for probabilistic clustering using cross-validated likelihood. Statistics and Computing, 9:63–72, 2000.

    Article  MathSciNet  Google Scholar 

  42. M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B, 36(2):111–147, 1974.

    MATH  Google Scholar 

  43. Evimaria Terzi and Panayiotis Tsaparas. Efficient algorithms for sequence segmentation. In 2006 SIAM Conference on Data Mining, pages 314–325, 2006.

    Google Scholar 

  44. V. Vazirani. Approximation algorithms. Springer, 2003.

    Google Scholar 

  45. Y.-L. Wu, D. Agrawal, and A. El Abbadi. A comparison of DFT and DWT based similarity search in time series databases. In Proceedings of the Ninth ACM International Conference on Information and Knowledge Management (CIKM’00), pages 488–495, November 2000.

    Google Scholar 

  46. B. Yi and C. Faloutsos. Fast time sequence indexing for arbitrary LP-norms. In Proceedings of the 26th International Conference on Very Large Databases (VLDB’00), pages 385–394, September 2000.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ella Bingham .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Bingham, E. (2010). Finding Segmentations of Sequences. In: Džeroski, S., Goethals, B., Panov, P. (eds) Inductive Databases and Constraint-Based Data Mining. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-7738-0_8

Download citation

  • DOI: https://doi.org/10.1007/978-1-4419-7738-0_8

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4419-7737-3

  • Online ISBN: 978-1-4419-7738-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics