Finding Segmentations of Sequences

Bingham, Ella

doi:10.1007/978-1-4419-7738-0_8

Ella Bingham⁴

663 Accesses
6 Citations

Abstract

We describe a collection of approaches to inductive querying systems for data that contain segmental structure. The main focus in this chapter is on work done in Helsinki area in 2004-2008. Segmentation is a general data mining technique for summarizing and analyzing sequential data. We first introduce the basic problem setting and notation.We then briefly present an optimal way to accomplish the segmentation, in the case of no added constraints. The challenge, however, lies in adding constraints that relate the segments to each other and make the end result more interpretable for the human eye, and/or make the computational task simpler. We describe various approaches to segmentation, ranging from efficient algorithms to added constraints and modifications to the problem. We also discuss topics beyond the basic task of segmentation, such as whether an output of a segmentation algorithm is meaningful or not, and touch upon some applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Miriam Ayer, H. D. Brunk, G. M. Ewing, W. T. Reid, and Edward Silverman. An empirical distribution function for sampling with incomplete information. Annals of Mathematical Statistics, 26(4):641–647, 1955.
Article MATH MathSciNet Google Scholar
Richard Bellman. On the approximation of curves by line segments using dynamic programming. Communications of the ACM, 4(6), 1961.
Google Scholar
K.D. Bennett. Determination of the number of zones in a biostratigraphical sequence. New Phytologist, 132(1):155–170, 1996.
Article Google Scholar
Pedro Bernaola-Galván, Ramón Román-Roldán, and José L. Oliver. Compositional segmentation and long-range fractal correlations in dna sequences. Phys. Rev. E, 53(5):5181–5189, 1996.
Article Google Scholar
Ella Bingham, Aristides Gionis, Niina Haiminen, Heli Hiisilä, Heikki Mannila, and Evimaria Terzi. Segmentation and dimensionality reduction. In 2006 SIAM Conference on Data Mining, pages 372–383, 2006.
Google Scholar
Harmen J. Bussemaker, Hao Li, and Eric D. Siggia. Regulatory element detection using a probabilistic segmentation model. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pages 67–74, 2000.
Google Scholar
A. Cantoni. Optimal curve fitting with piecewise linear functions. IEEE Transactions on Computers, C-20(1):59–67, 1971.
Article MathSciNet Google Scholar
K. Chakrabarti, E. Keogh, S. Mehrotra, and M. J. Pazzani. Locally adaptive dimensionality reduction for indexing large time series databases. ACM Transactions on Database Systems, 27(2):188–228, 2002.
Article Google Scholar
G.A. Churchill. Stochastic models for heterogenous dna sequences. Bulletin of Mathematical Biology, 51(1):79–94, 1989.
MATH MathSciNet Google Scholar
Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley, 1991.
Google Scholar
David Douglas and Thomas Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Canadian Cartographer, 10(2):112–122, 1973.
Google Scholar
Sorabh Gandhi, Luca Foschini, and Subhash Suri. Space-efficient online approximation of time series data: Streams, amnesia, and out-of-order. In Proceedings of the 26th IEEE International Conference on Data Engineering (ICDE), 2010.
Google Scholar
Aristides Gionis and Heikki Mannila. Finding recurrent sources in sequences. In Proceedings of the Sventh Annual International Conference on Computational Biology (RECOMB 2003), 2003.
Google Scholar
Aristides Gionis, Heikki Mannila, Taneli Mielikäinen, and Panayiotis Tsaparas. Assessing data mining results via swap randomization. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(3), 2007. Article No. 14.
Google Scholar
Aristides Gionis and Evimaria Terzi. Segmentations with rearrangements. In SIAM Data Mining Conference (SDM) 2007, 2007.
Google Scholar
S. Guha, N. Koudas, and K. Shim. Data-streams and histograms. In Symposium on the Theory of Computing (STOC), pages 471–475, 2001.
Google Scholar
Niina Haiminen. Mining sequential data — in search of segmental structure. PhD Thesis, Department of Computer Science, University of Helsinki, March 2008.
Google Scholar
Niina Haiminen and Aristides Gionis. Unimodal segmentation of sequences. In ICDM ’04: Proceedings of the Fourth IEEE International Conference on Data Mining, pages 106–113, 2004.
Google Scholar
Niina Haiminen and Heikki Mannila. Evaluation of BIC and cross validation for model selection on sequence segmentations. International Journal of Data Mining and Bioinformatics. In press.
Google Scholar
Niina Haiminen, Heikki Mannila, and Evimaria Terzi. Comparing segmentations by applying randomization techniques. BMC Bioinformatics, 8(171), 23 May 2007.
Google Scholar
Niina Haiminen, Heikki Mannila, and Evimaria Terzi. Determining significance of pairwise co-occurrences of events in bursty sequences. BMC Bioinformatics, 9:336, 2008.
Article Google Scholar
Trevor Hastie, R. Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer, 2001.
Google Scholar
J. Himberg, K. Korpiaho, H. Mannila, J. Tikanmäki, and H. T.T. Toivonen. Time series segmentation for context recognition in mobile devices. In Proceedings of the 2001 IEEE International Conference on Data Mining, pages 203–210, 2001.
Google Scholar
Dorit S. Hochbaum. Heuristics for the fixed cost median problem. Mathematical Programming, 22(1):148–162, 1982.
Article MATH MathSciNet Google Scholar
Saara Hyvönen, Aristides Gionis, and Heikki Mannila. Recurrent predictive models for sequence segmentation. In The 7th International Symposium on Intelligent Data Analysis, Lecture Notes in Computer Science. Springer, 2007.
Google Scholar
Eamonn Keogh, Selina Chu, David Hart, and Michael Pazzani. An online algorithm for segmenting time series. In Proceedings of the 2001 IEEE International Conference on Data Mining, pages 289–296, 2001.
Google Scholar
Eamonn Keogh and S. Kasetty. On the need for time series data mining benchmarks: A survey and empirical demonstration. In Proceedings of the ACM SIGKDD ’02, pages 102–111, July 2002.
Google Scholar
Eamonn Keogh and Michael J. Pazzani. An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In Proceedings of the ACM SIGKDD ’98, pages 239–243, August 1998.
Google Scholar
Victor Lavrenko, Matt Schmill, Dawn Lawrie, Paul Ogilvie, David Jensen, and James Allan. Mining of concurrent text and time series. In In proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Workshop on Text Mining, pages 37–44, 2000.
Google Scholar
W. Li. DNA segmentation as a model selection process. In Proceedings of the Fifth Annual International Conference on Computational Biology (RECOMB 2001), pages 204 – 210, 2001.
Google Scholar
Jyh-Han Lin and Jeffrey Scott Vitter. ε-approximations with minimum packing constraint violation. In Proc. ACM Symposium on Theory of Computing (STOC’92), pages 771–781, 1992.
Google Scholar
Jun S. Liu and Charles E. Lawrence. Bayesian inference on biopolymer models. Bioinformatics, 15(1):38–52, 1999.
Article Google Scholar
Taneli Mielikäinen, Evimaria Terzi, and Panayiotis Tsaparas. Aggregating time partitions. In The Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2006), pages 347–356, 2006.
Google Scholar
Markus Ojala, Niko Vuokko, Aleksi Kallio, Niina Haiminen, and Heikki Mannila. Randomization of real-valued matrices for assessing the significance of data mining results. In Proc. SIAM Data Mining Conference (SDM’08), pages 494–505, 2008.
Google Scholar
T. Palpanas, M. Vlachos, E. Keogh, D. Gunopulos, and W. Truppel. Online amnesic approximation of streaming time series. In ICDE 2004: Proceedings of the 20th International Conference on Data Engineering, pages 338–349, 2004.
Google Scholar
Themis Palpanas, Michail Vlachos, Eamonn Keogh, and Dimitrios Gunopulos. Streaming time series summarization using user-defined amnesic functions. IEEE Transactions on Knowledge and Data Engineering, 20(7):992–1006, 2008.
Article Google Scholar
V.E. Ramensky, V.J. Makeev, M.A. Roytberg, and V.G. Tumanyan. DNA segmentation through the Bayesian approach. Journal of Computational Biology, 7(1-2):215–231, 2000.
Article Google Scholar
Marko Salmenkivi, Juha Kere, and Heikki Mannila. Genome segmentation using piecewise constant intensity models and reversible jump MCMC. Bioinformatics (European Conference on Computational Biology), 18(2):211–218, 2002.
Google Scholar
G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464, 1978.
Article MATH MathSciNet Google Scholar
Hagit Shatkay and Stanley B. Zdonik. Approximate queries and representations for large data sequences. In ICDE ’96: Proceedings of the Twelfth International Conference on Data Engineering, pages 536–545, 1996.
Google Scholar
P. Smyth. Model selection for probabilistic clustering using cross-validated likelihood. Statistics and Computing, 9:63–72, 2000.
Article MathSciNet Google Scholar
M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B, 36(2):111–147, 1974.
MATH Google Scholar
Evimaria Terzi and Panayiotis Tsaparas. Efficient algorithms for sequence segmentation. In 2006 SIAM Conference on Data Mining, pages 314–325, 2006.
Google Scholar
V. Vazirani. Approximation algorithms. Springer, 2003.
Google Scholar
Y.-L. Wu, D. Agrawal, and A. El Abbadi. A comparison of DFT and DWT based similarity search in time series databases. In Proceedings of the Ninth ACM International Conference on Information and Knowledge Management (CIKM’00), pages 488–495, November 2000.
Google Scholar
B. Yi and C. Faloutsos. Fast time sequence indexing for arbitrary LP-norms. In Proceedings of the 26th International Conference on Very Large Databases (VLDB’00), pages 385–394, September 2000.
Google Scholar

Download references

Author information

Authors and Affiliations

Helsinki Institute for Information Technology, University of Helsinki and Aalto University School of Science and Technology, Helsink, Finland
Ella Bingham

Authors

Ella Bingham
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ella Bingham .

Editor information

Editors and Affiliations

, Department of Knowledge Technologies, Jozef Stefan Institute, Jamova 39, Ljubljana, 1000, Slovenia
Sašo Džeroski
, Mathematics and Computer Science, University of Antwerp, Middelheimlaan 1, Antwerpen, B-2020, Belgium
Bart Goethals
, Dept. of Knowledge Technologies, Jožef Stefan Institute, Jamova cesta 39, Ljubljana, SI-1000, Slovenia
Panče Panov

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bingham, E. (2010). Finding Segmentations of Sequences. In: Džeroski, S., Goethals, B., Panov, P. (eds) Inductive Databases and Constraint-Based Data Mining. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-7738-0_8

Download citation

DOI: https://doi.org/10.1007/978-1-4419-7738-0_8
Published: 18 November 2010
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4419-7737-3
Online ISBN: 978-1-4419-7738-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics