SOHAC: Efficient Storage of Tick Data That Supports Search and Analysis
Storage of tick data is a challenging problem because two criteria have to be fulfilled simultaneously: the storage structure should allow fast execution of queries and the data should not occupy too much space on the hard disk or in the main memory. In this paper, we present a clustering-based solution, and we introduce a new clustering algorithm that is designed to support the storage of tick data. We evaluate our algorithm both on publicly available real-world datasets, as well as real-world tick data from the financial domain provided by one of the world-wide most renowned investment bank. In our experiments we compare our approach, SOHAC, against a large collection of conventional hierarchical clustering algorithms from the literature. The experiments show that our algorithm substantially outperforms – both in terms of statistical significance and practical relevance – the examined clustering algorithms for the tick data storage problem.
KeywordsCluster Algorithm Stock Market Compression Ratio Single Linkage Cosine Similarity
Unable to display preview. Download preview PDF.
- 4.Ben-David, S., Von Luxburg, U., Pál, D.: A sober look at clustering stability. Learning Theory, 5–19 (2006)Google Scholar
- 5.Buza, K., Buza, A., Kis, P.: A distributed genetic algorithm for graph-based clustering. Man-Machine Interactions 2, 323–331 (2011)Google Scholar
- 6.Cortez, P., Morais, A.: A Data Mining Approach to Predict Forest Fires using Meteorological Data. In: New Trends in Artificial Intelligence, Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, pp. 512–523 (2007)Google Scholar
- 8.Frank, A., Asuncion, A.: Uci machine learning repository (2010), http://archive.ics.uci.edu/ml
- 10.Han, B., Yang, Z.: Data matrix compression by using co-clustering. In: 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2011), vol. 4, pp. 2600–2604 (July 2011)Google Scholar
- 12.Kurucz, M., Benczur, A., Csalogány, K., Lukács, L.: Spectral clustering in telephone call graphs. In: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis, pp. 82–91. ACM (2007)Google Scholar
- 16.Salomon, D.: Data compression: the complete reference. Springer-Verlag New York Inc. (2004)Google Scholar
- 18.Takayasu, M., Takayasu, H., Okazaki, M.P.: Transaction interval analysis of high resolution foreign exchange data. Empirical Science of Financial Fluctuations-The Advent of Econophysics 18, 25 (2002)Google Scholar
- 19.Tan, P., Steinbach, M., Kumar, V., et al.: Introduction to data mining. Pearson Addison Wesley, Boston (2006)Google Scholar
- 20.Thai-Nghe, N., Drumond, L., Horváth, T., Schmidt-Thieme, L.: Multi-relational factorization models for predicting student performance. In: KDD 2011 Workshop on Knowledge Discovery in Educational Data, KDDinED 2011 (2011)Google Scholar
- 22.Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann (2011)Google Scholar
- 24.Zhou, B.: High-frequency data and volatility in foreign-exchange rates. Journal of Business & Economic Statistics 14(1), 45–52 (1996)Google Scholar