Abstract
We describe a collection of approaches to inductive querying systems for data that contain segmental structure. The main focus in this chapter is on work done in Helsinki area in 2004-2008. Segmentation is a general data mining technique for summarizing and analyzing sequential data. We first introduce the basic problem setting and notation.We then briefly present an optimal way to accomplish the segmentation, in the case of no added constraints. The challenge, however, lies in adding constraints that relate the segments to each other and make the end result more interpretable for the human eye, and/or make the computational task simpler. We describe various approaches to segmentation, ranging from efficient algorithms to added constraints and modifications to the problem. We also discuss topics beyond the basic task of segmentation, such as whether an output of a segmentation algorithm is meaningful or not, and touch upon some applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Miriam Ayer, H. D. Brunk, G. M. Ewing, W. T. Reid, and Edward Silverman. An empirical distribution function for sampling with incomplete information. Annals of Mathematical Statistics, 26(4):641–647, 1955.
Richard Bellman. On the approximation of curves by line segments using dynamic programming. Communications of the ACM, 4(6), 1961.
K.D. Bennett. Determination of the number of zones in a biostratigraphical sequence. New Phytologist, 132(1):155–170, 1996.
Pedro Bernaola-Galván, Ramón Román-Roldán, and José L. Oliver. Compositional segmentation and long-range fractal correlations in dna sequences. Phys. Rev. E, 53(5):5181–5189, 1996.
Ella Bingham, Aristides Gionis, Niina Haiminen, Heli Hiisilä, Heikki Mannila, and Evimaria Terzi. Segmentation and dimensionality reduction. In 2006 SIAM Conference on Data Mining, pages 372–383, 2006.
Harmen J. Bussemaker, Hao Li, and Eric D. Siggia. Regulatory element detection using a probabilistic segmentation model. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pages 67–74, 2000.
A. Cantoni. Optimal curve fitting with piecewise linear functions. IEEE Transactions on Computers, C-20(1):59–67, 1971.
K. Chakrabarti, E. Keogh, S. Mehrotra, and M. J. Pazzani. Locally adaptive dimensionality reduction for indexing large time series databases. ACM Transactions on Database Systems, 27(2):188–228, 2002.
G.A. Churchill. Stochastic models for heterogenous dna sequences. Bulletin of Mathematical Biology, 51(1):79–94, 1989.
Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley, 1991.
David Douglas and Thomas Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Canadian Cartographer, 10(2):112–122, 1973.
Sorabh Gandhi, Luca Foschini, and Subhash Suri. Space-efficient online approximation of time series data: Streams, amnesia, and out-of-order. In Proceedings of the 26th IEEE International Conference on Data Engineering (ICDE), 2010.
Aristides Gionis and Heikki Mannila. Finding recurrent sources in sequences. In Proceedings of the Sventh Annual International Conference on Computational Biology (RECOMB 2003), 2003.
Aristides Gionis, Heikki Mannila, Taneli Mielikäinen, and Panayiotis Tsaparas. Assessing data mining results via swap randomization. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(3), 2007. Article No. 14.
Aristides Gionis and Evimaria Terzi. Segmentations with rearrangements. In SIAM Data Mining Conference (SDM) 2007, 2007.
S. Guha, N. Koudas, and K. Shim. Data-streams and histograms. In Symposium on the Theory of Computing (STOC), pages 471–475, 2001.
Niina Haiminen. Mining sequential data — in search of segmental structure. PhD Thesis, Department of Computer Science, University of Helsinki, March 2008.
Niina Haiminen and Aristides Gionis. Unimodal segmentation of sequences. In ICDM ’04: Proceedings of the Fourth IEEE International Conference on Data Mining, pages 106–113, 2004.
Niina Haiminen and Heikki Mannila. Evaluation of BIC and cross validation for model selection on sequence segmentations. International Journal of Data Mining and Bioinformatics. In press.
Niina Haiminen, Heikki Mannila, and Evimaria Terzi. Comparing segmentations by applying randomization techniques. BMC Bioinformatics, 8(171), 23 May 2007.
Niina Haiminen, Heikki Mannila, and Evimaria Terzi. Determining significance of pairwise co-occurrences of events in bursty sequences. BMC Bioinformatics, 9:336, 2008.
Trevor Hastie, R. Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer, 2001.
J. Himberg, K. Korpiaho, H. Mannila, J. Tikanmäki, and H. T.T. Toivonen. Time series segmentation for context recognition in mobile devices. In Proceedings of the 2001 IEEE International Conference on Data Mining, pages 203–210, 2001.
Dorit S. Hochbaum. Heuristics for the fixed cost median problem. Mathematical Programming, 22(1):148–162, 1982.
Saara Hyvönen, Aristides Gionis, and Heikki Mannila. Recurrent predictive models for sequence segmentation. In The 7th International Symposium on Intelligent Data Analysis, Lecture Notes in Computer Science. Springer, 2007.
Eamonn Keogh, Selina Chu, David Hart, and Michael Pazzani. An online algorithm for segmenting time series. In Proceedings of the 2001 IEEE International Conference on Data Mining, pages 289–296, 2001.
Eamonn Keogh and S. Kasetty. On the need for time series data mining benchmarks: A survey and empirical demonstration. In Proceedings of the ACM SIGKDD ’02, pages 102–111, July 2002.
Eamonn Keogh and Michael J. Pazzani. An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In Proceedings of the ACM SIGKDD ’98, pages 239–243, August 1998.
Victor Lavrenko, Matt Schmill, Dawn Lawrie, Paul Ogilvie, David Jensen, and James Allan. Mining of concurrent text and time series. In In proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Workshop on Text Mining, pages 37–44, 2000.
W. Li. DNA segmentation as a model selection process. In Proceedings of the Fifth Annual International Conference on Computational Biology (RECOMB 2001), pages 204 – 210, 2001.
Jyh-Han Lin and Jeffrey Scott Vitter. ε-approximations with minimum packing constraint violation. In Proc. ACM Symposium on Theory of Computing (STOC’92), pages 771–781, 1992.
Jun S. Liu and Charles E. Lawrence. Bayesian inference on biopolymer models. Bioinformatics, 15(1):38–52, 1999.
Taneli Mielikäinen, Evimaria Terzi, and Panayiotis Tsaparas. Aggregating time partitions. In The Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2006), pages 347–356, 2006.
Markus Ojala, Niko Vuokko, Aleksi Kallio, Niina Haiminen, and Heikki Mannila. Randomization of real-valued matrices for assessing the significance of data mining results. In Proc. SIAM Data Mining Conference (SDM’08), pages 494–505, 2008.
T. Palpanas, M. Vlachos, E. Keogh, D. Gunopulos, and W. Truppel. Online amnesic approximation of streaming time series. In ICDE 2004: Proceedings of the 20th International Conference on Data Engineering, pages 338–349, 2004.
Themis Palpanas, Michail Vlachos, Eamonn Keogh, and Dimitrios Gunopulos. Streaming time series summarization using user-defined amnesic functions. IEEE Transactions on Knowledge and Data Engineering, 20(7):992–1006, 2008.
V.E. Ramensky, V.J. Makeev, M.A. Roytberg, and V.G. Tumanyan. DNA segmentation through the Bayesian approach. Journal of Computational Biology, 7(1-2):215–231, 2000.
Marko Salmenkivi, Juha Kere, and Heikki Mannila. Genome segmentation using piecewise constant intensity models and reversible jump MCMC. Bioinformatics (European Conference on Computational Biology), 18(2):211–218, 2002.
G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464, 1978.
Hagit Shatkay and Stanley B. Zdonik. Approximate queries and representations for large data sequences. In ICDE ’96: Proceedings of the Twelfth International Conference on Data Engineering, pages 536–545, 1996.
P. Smyth. Model selection for probabilistic clustering using cross-validated likelihood. Statistics and Computing, 9:63–72, 2000.
M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B, 36(2):111–147, 1974.
Evimaria Terzi and Panayiotis Tsaparas. Efficient algorithms for sequence segmentation. In 2006 SIAM Conference on Data Mining, pages 314–325, 2006.
V. Vazirani. Approximation algorithms. Springer, 2003.
Y.-L. Wu, D. Agrawal, and A. El Abbadi. A comparison of DFT and DWT based similarity search in time series databases. In Proceedings of the Ninth ACM International Conference on Information and Knowledge Management (CIKM’00), pages 488–495, November 2000.
B. Yi and C. Faloutsos. Fast time sequence indexing for arbitrary LP-norms. In Proceedings of the 26th International Conference on Very Large Databases (VLDB’00), pages 385–394, September 2000.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Bingham, E. (2010). Finding Segmentations of Sequences. In: Džeroski, S., Goethals, B., Panov, P. (eds) Inductive Databases and Constraint-Based Data Mining. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-7738-0_8
Download citation
DOI: https://doi.org/10.1007/978-1-4419-7738-0_8
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4419-7737-3
Online ISBN: 978-1-4419-7738-0
eBook Packages: Computer ScienceComputer Science (R0)