ADS: the adaptive data series index

Abstract

Numerous applications continuously produce big amounts of data series, and in several time critical scenarios analysts need to be able to query these data as soon as they become available. This, however, is not currently possible with the state-of-the-art indexing methods and for very large data series collections. In this paper, we present the first adaptive indexing mechanism, specifically tailored to solve the problem of indexing and querying very large data series collections. We present a detailed design and evaluation of our method using approximate and exact query algorithms with both synthetic and real data sets. Adaptive indexing significantly outperforms previous solutions, gracefully handling large data series collections, reducing the data to query delay: By the time state-of-the-art indexing techniques finish indexing 1 billion data series (and before answering even a single query), our method has already answered \(3*10^5\) queries.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22

Notes

  1. 1.

    This paper is an extended version of [22]. It describes an exact search algorithm and a new full index construction method, both outperforming the state of the art. It also includes more detailed discussions and additional experiments.

References

  1. 1.

    Huijse, P., Estévez, P.A., Protopapas, P., Principe, J.C., Zegers, P.: Computational intelligence challenges and applications on large-scale astronomical time series databases. IEEE Comput. Intell. Mag. 9(3), 27–39 (2014)

    Article  Google Scholar 

  2. 2.

    Kashino, K., Smith, G., Murase, H.: Time-series active search for quick retrieval of audio and video. In: ICASSP (1999)

  3. 3.

    Raza, U., Camerra, A., Murphy, A.L., Palpanas, T., Picco, G.P.: Practical data prediction for real-world wireless sensor networks. IEEE Trans. Knowl. Data Eng. 27(8), 2231–2244 (2015)

  4. 4.

    Shasha, D.: Tuning time series queries in finance: case studies and recommendations. IEEE Data Eng. Bull. 22(2), 40–46 (1999)

    Google Scholar 

  5. 5.

    Ye, L., Keogh, E.J.: Time series shapelets: a new primitive for data mining. In: KDD (2009)

  6. 6.

    Bu, Y., Wing L.T., Chee F.A.W., Keogh, E., Pei, J., Meshkin, S.: Wat: finding top-k discords in time series database. In: SDM (2007)

  7. 7.

    Dallachiesa, M., Nushi, B., Mirylenka, K., Palpanas, T.: Uncertain time-series similarity: return to the basics. PVLDB 5(11), 1662–1673 (2012)

    Google Scholar 

  8. 8.

    Dallachiesa, M., Palpanas, T., Ilyas, I.F.: Top-k nearest neighbor search in uncertain data series. PVLDB 8(1), 13–24 (2014)

    Google Scholar 

  9. 9.

    Rakthanmanon, T., Campana, B., Mueen, A., Batista, G., Westover, B., Zhu, Q., Zakaria, J., Keogh, E.: Searching and mining trillions of time series subsequences under dynamic time warping. In: KDD (2012)

  10. 10.

    Rodrigues, P., Gama, J., Pedroso, J.: Hierarchical clustering of time-series data streams. IEEE Trans. Knowl. Data Eng. 20(5), 615–627 (2008)

    Article  Google Scholar 

  11. 11.

    Wang, Y., Wang, P., Pei, J., Wang, W., Huang, S.: A data-adaptive and dynamic segmentation index for whole matching on time series. PVLDB 6(10), 793–804 (2013)

    Google Scholar 

  12. 12.

    Camerra, A., Palpanas, T., Shieh, J., Keogh, E.: iSAX 2.0: indexing and mining one billion time series. In: ICDM (2010)

  13. 13.

    QualiMaster a configurable real-time data processing infrastructure mastering autonomous quality adaptation—deliverable D1.1: initial use cases and requirements. Technical report, QualiMaster Project (2014)

  14. 14.

    Rogers, S.: Big data is scaling bi and analytics Information Management. http://www.information-management.com/issues/21_5/big-data-is-scaling-bi-and-analytics-10021093-1.html (2011). Accessed 28 Aug 2016

  15. 15.

    Adhd-200. http://fcon_1000.projects.nitrc.org/indi/adhd200/ (2011)

  16. 16.

    Sloan digital sky survey. https://www.sdss3.org/dr10/data_access/volume.php (2015)

  17. 17.

    Idreos, S., Alagiannis, I., Johnson, R., Ailamaki, A.: Here are my data files. Here are my queries. Where are my results? In: CIDR (2011)

  18. 18.

    Idreos, S., Liarou, E.: dbtouch: analytics at your fingertips. In: CIDR (2013)

  19. 19.

    Guttman, A.: R-trees a dynamic structure for spatial searching. In: SIGMOD (1984)

  20. 20.

    Berchtold, S., Keim, D.A., Kriegel, H.P.: The X-tree: an index structure for high-dimensional data. In: VLDB (1996)

  21. 21.

    Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)

    MathSciNet  Article  MATH  Google Scholar 

  22. 22.

    Zoumpatianos, K., Idreos, S., Palpanas, T.: Indexing for interactive exploration of big data series. In: SIGMOD (2014)

  23. 23.

    Agrawal, R., Faloutsos, C., Swami, A.N.: Efficient similarity search in sequence databases. In: FODO Conference (1993)

  24. 24.

    Keogh, E.J., Pazzani, M.J.: An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In: KDD (1998)

  25. 25.

    Rakthanmanon, T., Keogh, E.J., Lonardi, S., Evans, S.: Time series epenthesis: clustering time series streams requires ignoring some data. In: ICDE (2011)

  26. 26.

    Warren, T.W.: Clustering of time series data—a survey. Pattern Recognit. 38(11), 1857–1874 (2005)

    Article  MATH  Google Scholar 

  27. 27.

    Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 1–58 (2009)

    Article  Google Scholar 

  28. 28.

    Wang, X., Mueen, A., Ding, H., Trajcevski, G., Scheuermann, P., Keogh, E.J.: Experimental comparison of representation methods and distance measures for time series data. DMKD 26(2), 275–309 (2013)

    MathSciNet  Google Scholar 

  29. 29.

    Chen, L., Özsu, M.T., Oria, V.: Robust and fast similarity search for moving object trajectories. In: SIGMOD (2005)

  30. 30.

    Vlachos, M., Gunopulos, D., Kollios, G.: Discovering similar multidimensional trajectories. In: ICDE (2002)

  31. 31.

    Palpanas, T., Vlachos, M., Keogh, E.J., Gunopulos, D.: Streaming time series summarization using user-defined amnesic functions. TKDE 20(7), 992–1006 (2008)

    Google Scholar 

  32. 32.

    Palpanas, T., Vlachos, M., Keogh, E.J., Gunopulos, D., Truppel, W.: Online amnesic approximation of streaming time series. In: ICDE, pp. 339–349 (2004)

  33. 33.

    Chan, K.P., Fu, A.C.: Efficient time series matching by wavelets. In: ICDE (1999)

  34. 34.

    Keogh, E., Chakrabarti, K., Pazzani, M.: Dimensionality reduction for fast similarity search in large time series databases. KAIS 3(3), 263–286 (2000)

    MATH  Google Scholar 

  35. 35.

    Yi, B., Faloutsos, C.: Fast time sequence indexing for arbitrary lp norms. In: VLDB (2000)

  36. 36.

    Lin, J., Keogh, E., Lonardi, S.: A symbolic representation of time series, with implications for streaming algorithms. In: DMKD, pp. 2–11 (2003)

  37. 37.

    Assent, I., Krieger, R., Afschari, F., Seidl, T.: The TS-tree: efficient time series search and retrieval. In: EDBT (2008)

  38. 38.

    Shieh, J., Keogh, E.: iSAX: indexing and mining terabyte sized time series. In: KDD (2008)

  39. 39.

    Shieh, J., Keogh, E.: iSAX: disk-aware mining and indexing of massive time series datasets. DMKD 19(1), 24–57 (2009)

    MathSciNet  Google Scholar 

  40. 40.

    Graefe, G., Halim, F., Idreos, S., Kuno, H.A., Manegold, S.: Concurrency control for adaptive indexing. PVLDB 5(7), 656–667 (2012)

    Google Scholar 

  41. 41.

    Graefe, G., Halim, F., Idreos, S., Kuno, H.A., Manegold, S., Seeger, B.: Transactional support for adaptive indexing. VLDB J. 23(2), 303–328 (2014)

    Article  Google Scholar 

  42. 42.

    Halim, F., Idreos, S., Karras, P., Yap, R.H.C.: Stochastic database cracking: towards robust adaptive indexing in main-memory column-stores. PVLDB 5(6), 502–513 (2012)

    Google Scholar 

  43. 43.

    Idreos, S., Kersten, M.L., Manegold, S.: Updating a cracked database. In: SIGMOD, pp. 413–424 (2007)

  44. 44.

    Idreos, S., Kersten, M.L., Manegold, S.: Database cracking. In: CIDR (2007)

  45. 45.

    Idreos, S., Kersten, M.L., Manegold, S.: Self-organizing tuple reconstruction in column-stores. In: SIGMOD (2009)

  46. 46.

    Idreos, S., Manegold, S., Kuno, H.A., Graefe, G.: Merging what’s cracked, cracking what’s merged: adaptive indexing in main-memory column-stores. PVLDB 4(9), 585–597 (2011)

    Google Scholar 

  47. 47.

    Schuhknecht, F.M., Jindal, A., Dittrich, J.: The uncracked pieces in database cracking. PVLDB 7(2), 97–108 (2013)

    Google Scholar 

  48. 48.

    Richter, S., Quiane-Ruiz, J.-A., Schuh, S., Dittrich, J.: Towards zero-overhead static and adaptive indexing in hadoop. VLDBJ 23(3), 469–494 (2013)

  49. 49.

    Zhou, J., Ross, K.A.: Buffering accesses to memory-resident index structures. In: VLDB (2003)

  50. 50.

    Zhou, J., Ross, K.A., Buffering database operations for enhanced instruction cache performance. In: SIGMOD (2004)

  51. 51.

    Stonebraker, M.: The case for partial indexes. SIGMOD Rec. 18(4), 4–11 (1989)

    Article  Google Scholar 

  52. 52.

    Achakeev, D., Seeger, B.: Efficient bulk updates on multiversion b-trees. PVLDB 6(14), 1834–1845 (2013)

    Google Scholar 

  53. 53.

    Ghanem, T.M., Shah, R., Mokbel, M.F., Aref, W.G., Vitter, J.S.: Bulk operations for space-partitioning trees. In: ICDE (2004)

  54. 54.

    Zoumpatianos, K., Lou, Y., Palpanas, T., Gehrke, J.: Query workloads for data series indexes. In: KDD (2015)

  55. 55.

    Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence matching in time-series databases. In: SIGMOD (1994)

  56. 56.

    Rafiei, D., Mendelzon, A.: Similarity-based queries for time series data. In: SIGMOD, pp. 13–25 (1997)

  57. 57.

    Jegou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. TPAMI 33(1), 117–128 (2011)

    Article  Google Scholar 

  58. 58.

    Camerra, A., Shieh, J., Palpanas, T., Rakthanmanon, T., Keogh, E.: Beyond one billion time series: indexing and mining very large time series collections with iSAX2+. KAIS 39(1), 123–151 (2014)

  59. 59.

    Incorporated Research Institutions for Seismology—Seismic Data Access. http://ds.iris.edu/data/access/ (2016)

  60. 60.

    Soldi, S., Beckmann, V., Baumgartner, W., Ponti, G., Shrader, C.R., Lubiński, P., Krimm, H., Mattana, F., Tueller, J.: Long-term variability of agn at hard X-rays. Astron. Astrophys. 563, A57 (2014)

    Article  Google Scholar 

  61. 61.

    Kashyap, S., Karras, P.: Scalable kNN search on vertically stored time series. In: KDD (2011)

  62. 62.

    Palpanas, T.: Data series management: the road to big sequence analytics. SIGMOD Rec. 44(2), 47–52 (2015)

    Article  Google Scholar 

  63. 63.

    Zoumpatianos, K., Idreos, S., Palpanas, T.: RINSE: interactive data series exploration with ADS+. PVLDB 8(12), 1912–1923 (2015)

    Google Scholar 

  64. 64.

    du Mouza, C., Litwin, W., Rigaux, P.: SD-Rtree: a scalable distributed rtree. In: ICDE (2007)

  65. 65.

    Wang, J., Wu, S., Gao, H., Li, J., Ooi, B.C,: Indexing multi-dimensional data in a cloud system. In: SIGMOD (2010)

  66. 66.

    Xie, Y., Palsetia, D., Trajcevski, G., Agrawal, A., Choudhary, A.N.: SILVERBACK: scalable association mining for temporal data in columnar probabilistic databases. In: ICDE (2014)

Download references

Acknowledgments

We would like to thank Prof. Volker Beckmann for providing us the Astro data set [60].

Author information

Affiliations

Authors

Corresponding author

Correspondence to Kostas Zoumpatianos.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zoumpatianos, K., Idreos, S. & Palpanas, T. ADS: the adaptive data series index. The VLDB Journal 25, 843–866 (2016). https://doi.org/10.1007/s00778-016-0442-5

Download citation

Keywords

  • Data Series
  • Query Processing
  • Leaf Size
  • Indexing Cost
  • Query Answering