Advertisement

Generating data series query workloads

  • Kostas Zoumpatianos
  • Yin Lou
  • Ioana Ileana
  • Themis Palpanas
  • Johannes Gehrke
Regular Paper
  • 66 Downloads

Abstract

Data series (including time series) has attracted lots of interest in recent years. Most of the research has focused on how to efficiently support similarity or nearest neighbor queries over large data series collections (an important data mining task), and several data series summarization and indexing methods have been proposed in order to solve this problem. Up to this point, very little attention has been paid to properly evaluating such index structures, with most previous works relying solely on randomly selected data series to use as queries. In this work, we show that random workloads are inherently not suitable for the task at hand and we argue that there is a need for carefully generating query workloads. We define measures that capture the characteristics of queries, and we propose a method for generating workloads with the desired properties, that is, effectively evaluating and comparing data series summarizations and indexes. In our experimental evaluation, with carefully controlled query workloads, we shed light on key factors affecting the performance of nearest neighbor search in large data series collections. This is the first paper that introduces a method for quantifying hardness of data series queries, as well as the ability to generate queries of predefined hardness.

Keywords

Time series Data series Similarity search Indexing Query workload generation 

References

  1. 1.
    Agrawal, R., Faloutsos, C., Swami, A.: Efficient similarity search in sequence databases. In: FODO (1993)Google Scholar
  2. 2.
    Assent, I., Krieger, R., Afschari, F., Seidl, T.: The ts-tree: Efficient time series search and retrieval. In: EDBT (2008)Google Scholar
  3. 3.
    Bagnall, A., Lines, J., Bostrom, A., Large, J., Keogh, E.J.: The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Min. Knowl. Discov. 31(3), 606–660 (2017).  https://doi.org/10.1007/s10618-016-0483-9 MathSciNetCrossRefGoogle Scholar
  4. 4.
    Bay, S.D., Kibler, D., Pazzani, M.J., Smyth, P.: The uci kdd archive of large data sets for data mining research and experimentation. In: SIGKDD Explorations (2000)Google Scholar
  5. 5.
    Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is “nearest neighbor" meaningful? In: ICDT (1999)Google Scholar
  6. 6.
    Camerra, A., Palpanas, T., Shieh, J., Keogh, E.: iSAX 2.0: Indexing and mining one billion time series. In: ICDM (2010)Google Scholar
  7. 7.
    Camerra, A., Shieh, J., Palpanas, T., Rakthanmanon, T., Keogh, E.: Beyond one billion time series: indexing and mining very large time series collections with isax2+. KAIS (2013)Google Scholar
  8. 8.
    Chakrabarti, K., Keogh, E., Mehrotra, S., Pazzani, M.: Locally adaptive dimensionality reduction for indexing large time series databases. In: SIGMOD (2002)Google Scholar
  9. 9.
    Chan, K.P., Fu, A.C.: Efficient time series matching by wavelets. In: ICDE (1999)Google Scholar
  10. 10.
    Chen, Q., Chen, L., Lian, X., Liu, Y., Yu, J.X.: Indexable pla for efficient similarity search. In: VLDB (2007)Google Scholar
  11. 11.
    Chow, C., Mokbel, M.F., Bao, J., Liu, X.: Query-aware location anonymization for road networks. GeoInformatica 15(3), 571–607 (2011).  https://doi.org/10.1007/s10707-010-0117-0 CrossRefGoogle Scholar
  12. 12.
    Dallachiesa, M., Nushi, B., Mirylenka, K., Palpanas, T.: Uncertain time-series similarity: Return to the basics. In: VLDB (2012)Google Scholar
  13. 13.
    Dallachiesa, M., Palpanas, T., Ilyas, I.F.: Top-k nearest neighbor search in uncertain data series. In: VLDB (2015)Google Scholar
  14. 14.
    Das, G., Gunopulos, D., Mannila, H.: Finding similar time series. In: Principles of Data Mining and Knowledge Discovery, First European Symposium, PKDD ’97, Trondheim, Norway, June 24-27, 1997, Proceedings, pp. 88–100 (1997).  https://doi.org/10.1007/3-540-63223-9_109
  15. 15.
    Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence matching in time-series databases. In: SIGMOD (1994)Google Scholar
  16. 16.
    Fu, A.W., Leung, O.T., Keogh, E.J., Lin, J.: Finding time series discords based on haar transform. In: Advanced Data Mining and Applications, Second International Conference, ADMA 2006, Xi’an, China, August 14-16, 2006, Proceedings, pp. 31–41 (2006).  https://doi.org/10.1007/11811305_3
  17. 17.
    Goldin, D.Q., Kanellakis, P.C.: On similarity queries for time-series data: Constraint specification and implementation. In: Principles and Practice of Constraint Programming (1995)Google Scholar
  18. 18.
    Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: SIGMOD (1984)Google Scholar
  19. 19.
    Huijse, P., Estévez, P.A., Protopapas, P., Principe, J.C., Zegers, P.: Computational intelligence challenges and applications on large-scale astronomical time series databases. IEEE Comp. Int. Mag. 9(3), 27–39 (2014)CrossRefGoogle Scholar
  20. 20.
    Kashino, K., Smith, G., Murase, H.: Time-series active search for quick retrieval of audio and video. In: ICASSP (1999)Google Scholar
  21. 21.
    Kashyap, S., Karras, P.: Scalable knn search on vertically stored time series. In: KDD (2011)Google Scholar
  22. 22.
    Keogh, E.: Machine learning in time series databases (and everything is a time series!). In: Tutorial at the AAAI International Conference on Artificial Intelligence, vol. 2 (2011)Google Scholar
  23. 23.
    Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. KAIS 3 (2000)Google Scholar
  24. 24.
    Keogh, E., Pazzani, M.: Scaling up dynamic time warping to massive datasets. In: PKDD (1999)Google Scholar
  25. 25.
    Korn, F., Jagadish, H.V., Faloutsos, C.: Efficiently supporting ad hoc queries in large datasets of time sequences. In: SIGMOD (1997)Google Scholar
  26. 26.
    Kremer, H., Günnemann, S., Ivanescu, A.M., Assent, I., Seidl, T.: Efficient processing of multiple dtw queries in time series databases. In: SSDBM (2011)Google Scholar
  27. 27.
    Li, C.S., Yu, P., Castelli, V.: Hierarchyscan: a hierarchical similarity search algorithm for databases of long sequences. In: ICDE (1996)Google Scholar
  28. 28.
    Lin, J., Keogh, E., Lonardi, S., Chiu, B.: A symbolic representation of time series, with implications for streaming algorithms. In: DMKD (2003)Google Scholar
  29. 29.
    Lin, J., Keogh, E.J., Wei, L., Lonardi, S.: Experiencing SAX: a novel symbolic representation of time series. Data Min. Knowl. Discov. 15(2), 107–144 (2007)MathSciNetCrossRefGoogle Scholar
  30. 30.
    Lin, J., Khade, R., Li, Y.: Rotation-invariant similarity in time series using bag-of-patterns representation. J. Intell. Inf. Syst. 39(2), 287–315 (2012)CrossRefGoogle Scholar
  31. 31.
    Prabhakar, S., Xia, Y., Kalashnikov, D.V., Aref, W.G., Hambrusch, S.E.: Query indexing and velocity constrained indexing: scalable techniques for continuous queries on moving objects. IEEE Trans. Comput. 51(10), 1124–1140 (2002).  https://doi.org/10.1109/TC.2002.1039840 MathSciNetCrossRefzbMATHGoogle Scholar
  32. 32.
    Rafiei, D., Mendelzon, A.: Similarity-based queries for time series data. In: SIGMOD (1997)Google Scholar
  33. 33.
    Rafiei, D., Mendelzon, A.: Efficient retrieval of similar time sequences using dft. In: ICDE (1998)Google Scholar
  34. 34.
    Rakthanmanon, T., Campana, B., Mueen, A., Batista, G., Westover, B., Zhu, Q., Zakaria, J., Keogh, E.: Searching and mining trillions of time series subsequences under dynamic time warping. In: KDD (2012)Google Scholar
  35. 35.
    Ratanamahatana, C.A., Lin, J., Gunopulos, D., Keogh, E.J., Vlachos, M., Das, G.: Mining time series data. In: Data Mining and Knowledge Discovery Handbook, 2nd ed., pp. 1049–1077 (2010).  https://doi.org/10.1007/978-0-387-09823-4_56
  36. 36.
    Ravi Kanth, K.V., Agrawal, D., Singh, A.: Dimensionality reduction for similarity searching in dynamic databases. In: SIGMOD (1998)Google Scholar
  37. 37.
    Schäfer, P., Högqvist, M.: Sfa: A symbolic fourier approximation and index for similarity search in high dimensional datasets. In: EDBT (2012)Google Scholar
  38. 38.
    Shasha, D.: Tuning time series queries in finance: Case studies and recommendations. IEEE Data Eng. Bull. 22(2), 40–46 (1999)Google Scholar
  39. 39.
    Shieh, J., Keogh, E.: isax: Indexing and mining terabyte sized time series. In: KDD (2008)Google Scholar
  40. 40.
    Wang, X., Mueen, A., Ding, H., Trajcevski, G., Scheuermann, P., Keogh, E.: Experimental comparison of representation methods and distance measures for time series data. DMKD 26(2), 275–309 (2013)Google Scholar
  41. 41.
    Wang, Y., Wang, P., Pei, J., Wang, W., Huang, S.: A data-adaptive and dynamic segmentation index for whole matching on time series. In: VLDB (2013)Google Scholar
  42. 42.
    Ye, L., Keogh, E.J.: Time series shapelets: a new primitive for data mining. In: KDD (2009)Google Scholar
  43. 43.
    Yi, B.K., Jagadish, H., Faloutsos, C.: Efficient retrieval of similar time sequences under time warping. In: ICDE (1998)Google Scholar
  44. 44.
    Zoumpatianos, K., Idreos, S., Palpanas, T.: Indexing for interactive exploration of big data series. In: SIGMOD (2014)Google Scholar
  45. 45.
    Zoumpatianos, K., Idreos, S., Palpanas, T.: Rinse: Interactive data series exploration. In: VLDB (2015)Google Scholar
  46. 46.
    Zoumpatianos, K., Lou, Y., Palpanas, T., Gehrke, J.: Query workloads for data series indexes. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 2015, pp. 1603–1612 (2015)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Harvard UniversityCambridgeUSA
  2. 2.Airbnb Inc.San FranciscoUSA
  3. 3.LIPADE, Paris Descartes UniversityParisFrance
  4. 4.Microsoft Inc.RedmondUSA

Personalised recommendations