Clustering Random Walk Time Series
- 1.4k Downloads
Abstract
We present in this paper a novel non-parametric approach useful for clustering independent identically distributed stochastic processes. We introduce a pre-processing step consisting in mapping multivariate independent and identically distributed samples from random variables to a generic non-parametric representation which factorizes dependency and marginal distribution apart without losing any information. An associated metric is defined where the balance between random variables dependency and distribution information is controlled by a single parameter. This mixing parameter can be learned or played with by a practitioner, such use is illustrated on the case of clustering financial time series. Experiments, implementation and results obtained on public financial time series are online on a web portal http://www.datagrapple.com.
References
- 1.Amari, S.I., Cichocki, A.: Information geometry of divergence functions. Bull. Pol. Acad. Sci. Tech. Sci. 58(1), 183–195 (2010)Google Scholar
- 2.Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics (2007)Google Scholar
- 3.Bachelier, L.: Théorie de la spéculation. Gauthier-Villars (1900)Google Scholar
- 4.Basseville, M.: Divergence measures for statistical data processing. Sig. Process. 93(4), 621–633 (2013)MathSciNetCrossRefGoogle Scholar
- 5.Ben-David, S., Von Luxburg, U., Pál, D.: A sober look at clustering stability. In: Lugosi, G., Simon, H.U. (eds.) COLT 2006. LNCS (LNAI), vol. 4005, pp. 5–19. Springer, Heidelberg (2006) CrossRefGoogle Scholar
- 6.Ben-Hur, A., Elisseeff, A., Guyon, I.: A stability based method for discovering structure in clustered data. In: Pacific Symposium on Biocomputing, vol. 7, pp. 6–17 (2001)Google Scholar
- 7.Berndt, D.J., Clifford, J.: Using dynamic time warping to find patterns in time series. In: KDD Workshop, Seattle, WA, vol. 10, pp. 359–370 (1994)Google Scholar
- 8.Carlsson, G., Mémoli, F.: Characterization, stability and convergence of hierarchical clustering methods. J. Mach. Learn. Res. 11, 1425–1470 (2010)MathSciNetzbMATHGoogle Scholar
- 9.Deheuvels, P.: La fonction de dépendance empirique et ses propriétés. Un test non paramétrique d’indépendance. Acad. Roy. Belg. Bull. Cl. Sci. (5) 65(6), 274–292 (1979)Google Scholar
- 10.Ding, C., He, X.: K-means clustering via principal component analysis. In: Proceedings of the Twenty-First International Conference on Machine Learning, p. 29. ACM (2004)Google Scholar
- 11.Efron, B.: Bootstrap methods: another look at the jackknife. Ann. Stat. 7, 1–26 (1979)MathSciNetCrossRefzbMATHGoogle Scholar
- 12.Fama, E.F.: The behavior of stock-market prices. J. Bus. 38, 34–105 (1965)CrossRefGoogle Scholar
- 13.Harel, D., Koren, Y.: On clustering using random walks. In: Hariharan, R., Mukund, M., Vinay, V. (eds.) FSTTCS 2001. LNCS, vol. 2245, pp. 18–41. Springer, Heidelberg (2001) CrossRefGoogle Scholar
- 14.Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)CrossRefzbMATHGoogle Scholar
- 15.Ivanov, P.C., Rosenblum, M.G., Peng, C., Mietus, J., Havlin, S., Stanley, H., Goldberger, A.L.: Scaling behaviour of heartbeat intervals obtained by wavelet-based time-series analysis. Nature 383(6598), 323–327 (1996)CrossRefGoogle Scholar
- 16.Keogh, E., Lin, J., Fu, A.: Hot sax: efficiently finding the most unusual time series subsequence. In: Fifth IEEE International Conference on Data Mining, pp. 8-pp. IEEE (2005)Google Scholar
- 17.Krieger, A.M., Green, P.E.: A cautionary note on using internal cross validation to select the number of clusters. Psychometrika 64(3), 341–353 (1999)CrossRefzbMATHGoogle Scholar
- 18.Lange, T., Roth, V., Braun, M.L., Buhmann, J.M.: Stability-based validation of clustering solutions. Neural Comput. 16(6), 1299–1323 (2004)CrossRefzbMATHGoogle Scholar
- 19.Lin, J., Keogh, E., Lonardi, S., Chiu, B.: A symbolic representation of time series, with implications for streaming algorithms. In: Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp. 2–11. ACM (2003)Google Scholar
- 20.Marti, G., Very, P., Donnat, P.: Toward a generic representation of random variables for machine learning (2015). arXiv preprint arXiv:1506.00976
- 21.Meila, M., Shi, J.: A random walks view of spectral segmentation. In: AI and STATISTICS (AISTATS) (2001)Google Scholar
- 22.Munkres, J.: Algorithms for the assignment and transportation problems. J. Soc. Ind. Appl. Math. 5(1), 32–38 (1957)MathSciNetCrossRefzbMATHGoogle Scholar
- 23.Percival, D.B., Walden, A.T.: Wavelet Methods for Time Series Analysis, vol. 4. Cambridge University Press, Cambridge (2006) zbMATHGoogle Scholar
- 24.Shamir, O., Tishby, N.: Cluster stability for finite samples. In: NIPS (2007)Google Scholar
- 25.Shamir, O., Tishby, N.: Model selection and stability in k-means clustering. In: Learning Theory (2008)Google Scholar
- 26.Sklar, A.: Fonctions de répartition à n dimensions et leurs marges. Université Paris 8 (1959)Google Scholar
- 27.Von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)MathSciNetCrossRefGoogle Scholar