Abstract
In this paper, an improved Gath–Geva clustering algorithm is proposed for automatic fuzzy segmentation of univariate and multivariate hydrometeorological time series. The algorithm considers time series segmentation problem as Gath–Geva clustering with the minimum message length criterion as segmentation order selection criterion. One characteristic of the improved Gath–Geva clustering algorithm is its unsupervised nature which can automatically determine the optimal segmentation order. Another characteristic is the application of the modified component-wise expectation maximization algorithm in Gath–Geva clustering which can avoid the drawbacks of the classical expectation maximization algorithm: the sensitivity to initialization and the need to avoid the boundary of the parameter space. The other characteristic is the improvement of numerical stability by integrating segmentation order selection into model parameter estimation procedure. The proposed algorithm has been experimentally tested on artificial and hydrometeorological time series. The obtained experimental results show the effectiveness of our proposed algorithm.
Similar content being viewed by others
References
Abonyi J, Feil B, Nemeth S, Arva P (2003) Fuzzy clustering based segmentation of time-series. In: Lecture notes in computer science, pp 275–286
Abonyi J, Feil B, Nemeth S, Arva P (2005) Modified Gath–Geva clustering for fuzzy segmentation of multivariate time-series. Fuzzy Sets Syst 149:39–56
Aksoy H, Unal NE, Gedikli A (2007) Letter to the editor. Stoch Environ Res Risk Assess 21:447–449
Aksoy H, Gedikli A, Unal NE, Kehagias A (2008) Fast segmentation algorithms for long hydrometeorological time series. Hydrol Process 22:4600–4608
Aksoy H, Unal NE, Pektas AO (2008) Smoothed minima baseflow separation tool for perennial and intermittent streams. Hydrol Process 22:4467–4476
Athanasiadis EI, Cavouras DA, Spyridonos PP, Glotsos DT, Kalatzis IK, Nikiforidis GC (2009) Complementary DNA microarray image processing based on the fuzzy gaussian mixture model. IEEE Trans Inf Technol Biomed 13(4):419–425
Beeferman D, Berger A, Lafferty J (1999) Statistical models for text segmentation. Mach Learn 34:177–210
Bezdek JC, Dunn JC (1975) Optimal fuzzy partitions: a heuristic for estimating the parameters in a mixture of normal distributions. IEEE Trans Comput 835–838
Celeux G, Chretien S, Forbes F, Mkhadri A (1999) A component-wise EM algorithm for mixtures. Technical report 3746, INRIA, France
Chatzis S, Varvarigou T (2008) Robust fuzzy clustering using mixtures of student’s-t distributions. Pattern Recognit Lett 29:1901–1905
Figueiredo M, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24(3):381–396
Fisch D, Gruber T, Sick B (2011) Swiftrule: Mining comprehensible classification rules for time series analysis. IEEE Trans Knowl Data Eng 23(5):774–787
Fu Z, Robles-Kelly A, Zhou J (2010) Mixing linear SVMs for nonlinear classification. IEEE Trans Neural Netw 21:1963–1975
Fuchs E, Gruber T, Nitschke J, Sick B (2009) On-line motif detection in time series with swiftmotif. Pattern Recognit 42:3015–3031
Fuchs E, Gruber T, Nitschke J, Sick B (2010) Online segmentation of time series based on polynomial least-squares approximations. IEEE Trans Pattern Anal Mach Intell 32(12):2232–2245
Gath I, Geva AB (1989) Unsupervised optimal fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 7:773–780
Gedikli A, Aksoy H, Unal NE (2008) Segmentation algorithm for long time series analysis. Stoch Environ Res Risk Assess 22(3):291–302
Gedikli A, Aksoy H, Unal NE, Kehagias A (2010) Modified dynamic programming approach for offline segmentation of long hydrometeorological time series. Stoch Environ Res Risk Assess 24:547–557
Hanlon B, Forbes C (2002) Model selection criteria for segmented time series from a bayesian approach to information compression. Working paper, Department of Econometrics and Statistics, Monash University, Melbourne, Australia
Hubert P (2000) The segmentation procedure as a tool for discrete modeling of hydrometeorological regimes. Stoch Environ Res Risk Assess 14:297–304
Kehagias A (2004) A hidden markov model segmentation procedure for hydrological and environmental time series. Stoch Environ Res Risk Assess 18:117–130
Kehagias A, Fortin V (2006) Time series segmentation with shifting means hidden markov models. Nonlinear Process Geophys 13:339–352
Kehagias A, Nidelkou E, Petridis V (2005) A dynamic programming segmentation procedure for hydrological and environmental time series. Stoch Environ Res Risk Assess 20:77–94
Kehagias A, Petridis V, Nidelkou E (2007) Reply by the authors to the letter by Aksoy et al. Stoch Environ Res Risk Assess 21:451–455
Keogh E, Kasetty S (2003) On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Mining Knowl Discov 7(4):349–371
Lanterman AD (2001) Schwarz, Wallace, and Rissanen intertwining themes in theories of model order estimation. Int Stat Rev 69(2):185–212
Liu X, Lin Z, Wang H (2008) Novel online methods for time series segmentation. IEEE Trans Knowl Data Eng 20:1616–1626
Nascimento JC, Figueiredo M, Marques JS (2010) Trajectory classification using switched dynamical hidden Markov models. IEEE Trans Image Process 19(5):1338–1348
Povinelli R, Johnson M, Lindgren A, Ye J (2004) Time series classification using Gaussian mixture models of reconstructed phase spaces. IEEE Trans Knowl Data Eng 16(6):779–783
Seghouane A, Amari S (2007) The AIC criterion and symmetrizing the Kullback-Leibler divergence. IEEE Trans Neural Netw pp 97–106
Vernieuwe H, De Baets B, Verhoest NEC (2006) Comparison of clustering algorithms in the identification of Takagi-Sugeno models: A hydrological case study. Fuzzy Sets Syst 157:2876–2896
Warren Liao T (2005) Clustering of time series data-a survey. Pattern Recognit 38:1857–1874
Acknowledgements
The authors sincerely thank Professor Victor Leiva (Associate editor), Professor George Christakos (Editor), and the anonymous referees for their kind advice and comments. Their suggestions have led to a major improvement of the paper. This work is supported by the National Natural Science Foundation of China under Grants (No. 61175041) and the Fundamental Research Funds for the Central Universities (No. 2011QN147).
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: Gath–Geva clustering algorithm
Inputs: data set \({\cal{X}}=\{{\bf x}_{k}| 1\leq k\leq n\}, \) number of clusters 1 < c < n, weighting exponent m > 1, termination tolerance \(\varepsilon>0, \) initial parameters. \(\widehat{\varvec\theta}(0)=\{\widehat{\varvec\theta}_{1},\ldots,\widehat{\varvec\theta}_{c},\widehat{P}_{1},\ldots,\widehat{P}_{c}\}\)
Output: optimal parameters \(\widehat{\varvec\theta}_{opt}. \)
Initialize: partition matrix U(0) = [μ (0) i,k ] c × n such that Eq. 3 holds.
Repeat for \(l=1,2,\ldots\)
Calculate parameters \(\widehat{\varvec\theta}(l). \)
Compute the distance measurement D(x k ,v i )2.
Update the partition matrix \({\bf U}(l)=\big[\mu_{i,k}^{(l)}\big]_{c\times n}. \)
until \(\|{\bf U}(l)-{\bf U}(l-1)\|< \varepsilon. \)
Appendix 2: Bottom-up segmentation method
Create initial fine approximation by segment boundaries \(0=t_{n_{0}}<t_{n_{1}}<\cdots<t_{n_{c}}=t_{n}. \)
Find the cost of merging for each pair of segments: \( {\it {mergecost}}(i) = cost(t_{{n}_{i}}+1, t_{{n}_{i+2}}) \)
while min(mergecost) < maxerror
Find the cheapest pair to merge: \(i = \arg\min_{i} (mergecost(i)).\)
Merge the two segments, update the \( (t_{{n}_{i}}, t_{{n}_{i+1}}) \) boundary indices, and recalculate the merge costs.
\( {\it {mergecost}}(i) = cost(t_{{n}_{i}}+1, t_{{n}_{i+2}}) \),
\( {\it {mergecost}}(i-1) = cost(t_{{n}_{i-1}}+1, t_{{n}_{i+1}}) \).
end
Let covariance matrix F x i decompose to the matrix \(\varvec\Uplambda_{i}\) that includes the eigenvalues of F x i in its diagonal in decreasing order, and to the matrix U i that includes the eigenvectors corresponding to the eigenvalues in its columns, i.e., \({\bf F}_{i}^{x}={\bf U}_{i}\varvec\Uplambda_{i}{\bf U}_{i}^{T}. \) The segmentation cost can be equal to the reconstruction error of this segment
where Q i,k = x T k (I − U i,p U T i,p )x k , and U i,p is the eigenvectors corresponding to the first few p nonzero eigenvalues. The segmentation cost can also be equal to the Hotelling T 2 measure of this segment
where \(T_{i,k}^{2}={\bf y}_{i,k}^{T}{\bf y}_{i,k}, {\bf y}_{i,k}=\varvec\Uplambda_{i,p}^{-\frac{1}{2}}{\bf U}_{i,p}^{T}{\bf x}_{k}. \) The interested reader can find more details about the bottom-up method in Abonyi et al. (2005).
Rights and permissions
About this article
Cite this article
Wang, N., Liu, X. & Yin, J. Improved Gath–Geva clustering for fuzzy segmentation of hydrometeorological time series. Stoch Environ Res Risk Assess 26, 139–155 (2012). https://doi.org/10.1007/s00477-011-0542-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00477-011-0542-0