Skip to main content
Log in

Improved Gath–Geva clustering for fuzzy segmentation of hydrometeorological time series

  • Original Paper
  • Published:
Stochastic Environmental Research and Risk Assessment Aims and scope Submit manuscript

Abstract

In this paper, an improved Gath–Geva clustering algorithm is proposed for automatic fuzzy segmentation of univariate and multivariate hydrometeorological time series. The algorithm considers time series segmentation problem as Gath–Geva clustering with the minimum message length criterion as segmentation order selection criterion. One characteristic of the improved Gath–Geva clustering algorithm is its unsupervised nature which can automatically determine the optimal segmentation order. Another characteristic is the application of the modified component-wise expectation maximization algorithm in Gath–Geva clustering which can avoid the drawbacks of the classical expectation maximization algorithm: the sensitivity to initialization and the need to avoid the boundary of the parameter space. The other characteristic is the improvement of numerical stability by integrating segmentation order selection into model parameter estimation procedure. The proposed algorithm has been experimentally tested on artificial and hydrometeorological time series. The obtained experimental results show the effectiveness of our proposed algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  • Abonyi J, Feil B, Nemeth S, Arva P (2003) Fuzzy clustering based segmentation of time-series. In: Lecture notes in computer science, pp 275–286

  • Abonyi J, Feil B, Nemeth S, Arva P (2005) Modified Gath–Geva clustering for fuzzy segmentation of multivariate time-series. Fuzzy Sets Syst 149:39–56

    Article  Google Scholar 

  • Aksoy H, Unal NE, Gedikli A (2007) Letter to the editor. Stoch Environ Res Risk Assess 21:447–449

    Article  Google Scholar 

  • Aksoy H, Gedikli A, Unal NE, Kehagias A (2008) Fast segmentation algorithms for long hydrometeorological time series. Hydrol Process 22:4600–4608

    Article  Google Scholar 

  • Aksoy H, Unal NE, Pektas AO (2008) Smoothed minima baseflow separation tool for perennial and intermittent streams. Hydrol Process 22:4467–4476

    Article  Google Scholar 

  • Athanasiadis EI, Cavouras DA, Spyridonos PP, Glotsos DT, Kalatzis IK, Nikiforidis GC (2009) Complementary DNA microarray image processing based on the fuzzy gaussian mixture model. IEEE Trans Inf Technol Biomed 13(4):419–425

    Article  Google Scholar 

  • Beeferman D, Berger A, Lafferty J (1999) Statistical models for text segmentation. Mach Learn 34:177–210

    Article  Google Scholar 

  • Bezdek JC, Dunn JC (1975) Optimal fuzzy partitions: a heuristic for estimating the parameters in a mixture of normal distributions. IEEE Trans Comput 835–838

  • Celeux G, Chretien S, Forbes F, Mkhadri A (1999) A component-wise EM algorithm for mixtures. Technical report 3746, INRIA, France

  • Chatzis S, Varvarigou T (2008) Robust fuzzy clustering using mixtures of student’s-t distributions. Pattern Recognit Lett 29:1901–1905

    Article  Google Scholar 

  • Figueiredo M, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24(3):381–396

    Article  Google Scholar 

  • Fisch D, Gruber T, Sick B (2011) Swiftrule: Mining comprehensible classification rules for time series analysis. IEEE Trans Knowl Data Eng 23(5):774–787

    Article  Google Scholar 

  • Fu Z, Robles-Kelly A, Zhou J (2010) Mixing linear SVMs for nonlinear classification. IEEE Trans Neural Netw 21:1963–1975

    Article  Google Scholar 

  • Fuchs E, Gruber T, Nitschke J, Sick B (2009) On-line motif detection in time series with swiftmotif. Pattern Recognit 42:3015–3031

    Article  Google Scholar 

  • Fuchs E, Gruber T, Nitschke J, Sick B (2010) Online segmentation of time series based on polynomial least-squares approximations. IEEE Trans Pattern Anal Mach Intell 32(12):2232–2245

    Article  Google Scholar 

  • Gath I, Geva AB (1989) Unsupervised optimal fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 7:773–780

    Article  Google Scholar 

  • Gedikli A, Aksoy H, Unal NE (2008) Segmentation algorithm for long time series analysis. Stoch Environ Res Risk Assess 22(3):291–302

    Article  Google Scholar 

  • Gedikli A, Aksoy H, Unal NE, Kehagias A (2010) Modified dynamic programming approach for offline segmentation of long hydrometeorological time series. Stoch Environ Res Risk Assess 24:547–557

    Article  Google Scholar 

  • Hanlon B, Forbes C (2002) Model selection criteria for segmented time series from a bayesian approach to information compression. Working paper, Department of Econometrics and Statistics, Monash University, Melbourne, Australia

  • Hubert P (2000) The segmentation procedure as a tool for discrete modeling of hydrometeorological regimes. Stoch Environ Res Risk Assess 14:297–304

    Article  Google Scholar 

  • Kehagias A (2004) A hidden markov model segmentation procedure for hydrological and environmental time series. Stoch Environ Res Risk Assess 18:117–130

    Article  Google Scholar 

  • Kehagias A, Fortin V (2006) Time series segmentation with shifting means hidden markov models. Nonlinear Process Geophys 13:339–352

    Article  Google Scholar 

  • Kehagias A, Nidelkou E, Petridis V (2005) A dynamic programming segmentation procedure for hydrological and environmental time series. Stoch Environ Res Risk Assess 20:77–94

    Article  Google Scholar 

  • Kehagias A, Petridis V, Nidelkou E (2007) Reply by the authors to the letter by Aksoy et al. Stoch Environ Res Risk Assess 21:451–455

    Article  Google Scholar 

  • Keogh E, Kasetty S (2003) On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Mining Knowl Discov 7(4):349–371

    Article  Google Scholar 

  • Lanterman AD (2001) Schwarz, Wallace, and Rissanen intertwining themes in theories of model order estimation. Int Stat Rev 69(2):185–212

    Article  Google Scholar 

  • Liu X, Lin Z, Wang H (2008) Novel online methods for time series segmentation. IEEE Trans Knowl Data Eng 20:1616–1626

    Article  Google Scholar 

  • Nascimento JC, Figueiredo M, Marques JS (2010) Trajectory classification using switched dynamical hidden Markov models. IEEE Trans Image Process 19(5):1338–1348

    Article  Google Scholar 

  • Povinelli R, Johnson M, Lindgren A, Ye J (2004) Time series classification using Gaussian mixture models of reconstructed phase spaces. IEEE Trans Knowl Data Eng 16(6):779–783

    Article  Google Scholar 

  • Seghouane A, Amari S (2007) The AIC criterion and symmetrizing the Kullback-Leibler divergence. IEEE Trans Neural Netw pp 97–106

  • Vernieuwe H, De Baets B, Verhoest NEC (2006) Comparison of clustering algorithms in the identification of Takagi-Sugeno models: A hydrological case study. Fuzzy Sets Syst 157:2876–2896

    Article  Google Scholar 

  • Warren Liao T (2005) Clustering of time series data-a survey. Pattern Recognit 38:1857–1874

    Article  Google Scholar 

Download references

Acknowledgements

The authors sincerely thank Professor Victor Leiva (Associate editor), Professor George Christakos (Editor), and the anonymous referees for their kind advice and comments. Their suggestions have led to a major improvement of the paper. This work is supported by the National Natural Science Foundation of China under Grants (No. 61175041) and the Fundamental Research Funds for the Central Universities (No. 2011QN147).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nini Wang.

Appendices

Appendix 1: Gath–Geva clustering algorithm

Inputs: data set \({\cal{X}}=\{{\bf x}_{k}| 1\leq k\leq n\}, \) number of clusters 1 < c < n, weighting exponent m > 1, termination tolerance \(\varepsilon>0, \) initial parameters. \(\widehat{\varvec\theta}(0)=\{\widehat{\varvec\theta}_{1},\ldots,\widehat{\varvec\theta}_{c},\widehat{P}_{1},\ldots,\widehat{P}_{c}\}\)

Output: optimal parameters \(\widehat{\varvec\theta}_{opt}. \)

Initialize: partition matrix U(0) = [μ (0) i,k ] c × n such that Eq. 3 holds.

Repeat for \(l=1,2,\ldots\)

Calculate parameters \(\widehat{\varvec\theta}(l). \)

$$ \begin{aligned} {\mathbf v}_{i}(l)&=\frac{\sum_{k=1}^{n}\left(\mu_{i,k}^{(l-1)}\right)^{m}{\mathbf x}_{k}}{\sum_{k=1}^{n}\left(\mu_{i,k}^{(l-1)}\right)^{m}},\\ {\mathbf F}_{i}(l)&=\frac{\sum_{k=1}^{n}\left(\mu_{i,k}^{(l-1)}\right)^{m} \left({\mathbf x}_{k}-{\mathbf v}_{i}(l)\right)\left({\mathbf x}_{k}-{\mathbf v}_{i}(l)\right)^{T}}{\sum_{k=1}^{n}\left(\mu_{i,k}^{(l-1)}\right)^{m}},\\ P_{i}(l)&=\frac{1}{n}\sum_{k=1}^{n}\mu_{i,k}^{(l-1)}, \quad 1\leq i \leq c. \end{aligned} $$
(47)

Compute the distance measurement D(x k ,v i )2.

$$ \begin{aligned} D({\mathbf x}_{k},{\mathbf v}_{i})^{2} &=\frac{1}{P_{i}(l)G\left({\mathbf x}_{k}; {\mathbf v}_{i}(l), {\mathbf F}_{i}(l)\right)} =\frac{(2\pi)^{q/2}\sqrt{det({\mathbf F}_{i}(l))}}{P_{i}(l)}\\ &\cdot\exp\left(\frac{1}{2}({\mathbf x}_{k}-{\mathbf v}_{i}(l))^{T}({\mathbf F}_{i}(l))^{-1}({\mathbf x}_{k}-{\mathbf v}_{i}(l))\right). \end{aligned} $$
(48)

Update the partition matrix \({\bf U}(l)=\big[\mu_{i,k}^{(l)}\big]_{c\times n}. \)

$$ \mu_{i,k}^{(l)}=\frac{1}{\sum_{j=1}^{c}(D({\mathbf x}_{k},{\mathbf v}_{i})/D({\mathbf x}_{k},{\mathbf v}_{j}))^{2/(m-1)}},\quad 1 \leq i\leq c, 1 \leq k\leq n. $$
(49)

until \(\|{\bf U}(l)-{\bf U}(l-1)\|< \varepsilon. \)

Appendix 2: Bottom-up segmentation method

Create initial fine approximation by segment boundaries \(0=t_{n_{0}}<t_{n_{1}}<\cdots<t_{n_{c}}=t_{n}. \)

Find the cost of merging for each pair of segments: \( {\it {mergecost}}(i) = cost(t_{{n}_{i}}+1, t_{{n}_{i+2}}) \)

while min(mergecost) < maxerror

Find the cheapest pair to merge: \(i = \arg\min_{i} (mergecost(i)).\)

Merge the two segments, update the \( (t_{{n}_{i}}, t_{{n}_{i+1}}) \) boundary indices, and recalculate the merge costs.

\( {\it {mergecost}}(i) = cost(t_{{n}_{i}}+1, t_{{n}_{i+2}}) \),

\( {\it {mergecost}}(i-1) = cost(t_{{n}_{i-1}}+1, t_{{n}_{i+1}}) \).

end

Let covariance matrix F x i decompose to the matrix \(\varvec\Uplambda_{i}\) that includes the eigenvalues of F x i in its diagonal in decreasing order, and to the matrix U i that includes the eigenvectors corresponding to the eigenvalues in its columns, i.e., \({\bf F}_{i}^{x}={\bf U}_{i}\varvec\Uplambda_{i}{\bf U}_{i}^{T}. \) The segmentation cost can be equal to the reconstruction error of this segment

$$ cost(t_{n_{i}}+1, t_{n_{i+1}})=\frac{1}{t_{n_{i+1}}-t_{n_{i}}+1}\sum_{k=t_{n_{i}}+1}^{t_{n_{i+1}}}Q_{i,k}, $$

where Q i,k  = x T k (I − U i,p U T i,p )x k , and U i,p is the eigenvectors corresponding to the first few p nonzero eigenvalues. The segmentation cost can also be equal to the Hotelling T 2 measure of this segment

$$ cost(t_{n_{i}}+1, t_{n_{i}+1})=\frac{1}{t_{n_{i+1}}-t_{n_{i}}+1}\sum_{i=t_{n_{i}}+1}^{t_{n_{i+1}}}T_{i,k}^{2}, $$

where \(T_{i,k}^{2}={\bf y}_{i,k}^{T}{\bf y}_{i,k}, {\bf y}_{i,k}=\varvec\Uplambda_{i,p}^{-\frac{1}{2}}{\bf U}_{i,p}^{T}{\bf x}_{k}. \) The interested reader can find more details about the bottom-up method in Abonyi et al. (2005).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, N., Liu, X. & Yin, J. Improved Gath–Geva clustering for fuzzy segmentation of hydrometeorological time series. Stoch Environ Res Risk Assess 26, 139–155 (2012). https://doi.org/10.1007/s00477-011-0542-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00477-011-0542-0

Keywords

Navigation