Distributed and Parallel Databases

, Volume 18, Issue 2, pp 173–197 | Cite as

Stream Cube: An Architecture for Multi-Dimensional Analysis of Data Streams

  • Jiawei Han
  • Yixin Chen
  • Guozhu Dong
  • Jian Pei
  • Benjamin W. Wah
  • Jianyong Wang
  • Y. Dora Cai
Article

Abstract

Real-time surveillance systems, telecommunication systems, and other dynamic environments often generate tremendous (potentially infinite) volume of stream data: the volume is too huge to be scanned multiple times. Much of such data resides at rather low level of abstraction, whereas most analysts are interested in relatively high-level dynamic changes (such as trends and outliers). To discover such high-level characteristics, one may need to perform on-line multi-level, multi-dimensional analytical processing of stream data. In this paper, we propose an architecture, called stream_cube, to facilitate on-line, multi-dimensional, multi-level analysis of stream data.

For fast online multi-dimensional analysis of stream data, three important techniques are proposed for efficient and effective computation of stream cubes. First, a tilted time frame model is proposed as a multi-resolution model to register time-related data: the more recent data are registered at finer resolution, whereas the more distant data are registered at coarser resolution. This design reduces the overall storage of time-related data and adapts nicely to the data analysis tasks commonly encountered in practice. Second, instead of materializing cuboids at all levels, we propose to maintain a small number of critical layers. Flexible analysis can be efficiently performed based on the concept of observation layer and minimal interesting layer. Third, an efficient stream data cubing algorithm is developed which computes only the layers (cuboids) along a popular path and leaves the other cuboids for query-driven, on-line computation. Based on this design methodology, stream data cube can be constructed and maintained incrementally with a reasonable amount of memory, computation cost, and query response time. This is verified by our substantial performance study.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    S. Agarwal, R. Agrawal, P.M. Deshpande, A. Gupta, J.F. Naughton, R. Ramakrishnan, and S. Sarawagi, “On the computation of multidimensional aggregates,” in Proc. 1996 Int. Conf. Very Large Data Bases (VLDB'96), Bombay, India, Sept. 1996, pp. 506–521.Google Scholar
  2. 2.
    C. Aggarwal, J. Han, J. Wang, and P.S. Yu, “A framework for projected clustering of high dimensional data streams,” in Proc. 2004 Int. Conf. Very Large Data Bases (VLDB'04). Toronto, Canada, Aug. 2004, pp. 852–863.Google Scholar
  3. 3.
    C. Aggarwal, J. Han, J. Wang, and P.S. Yu, “On demand classification of data streams,” in Proc. 2004 Int. Conf. Knowledge Discovery and Data Mining (KDD'04), Seattle, WA, Aug. 2004, pp. 503–508.Google Scholar
  4. 4.
    C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu, “A framework for clustering evolving data streams,” in Proc. 2003 Int. Conf. Very Large Data Bases (VLDB'03), Berlin, Germany, Sept. 2003.Google Scholar
  5. 5.
    R. Agrawal and R. Srikant, “Mining sequential patterns,” in Proc. 1995 Int. Conf. Data Engineering (ICDE'95), Taipei, Taiwan, March 1995, pp. 3–14.Google Scholar
  6. 6.
    B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Models and issues in data stream systems,” in Proc. 2002 ACM Symp. Principles of Database Systems (PODS'02), Madison, WI, June 2002, pp. 1–16.Google Scholar
  7. 7.
    S. Babu and J. Widom, “Continuous queries over data streams,” SIGMOD Record, vol. 30, pp. 109–120, 2001.Google Scholar
  8. 8.
    K. Beyer and R. Ramakrishnan, “Bottom-up computation of sparse and iceberg cubes,” in Proc. 1999 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'99), Philadelphia, PA, June 1999, pp. 359–370.Google Scholar
  9. 9.
    S. Chaudhuri and U. Dayal, “An overview of data warehousing and OLAP technology,” SIGMOD Record, vol. 26, pp. 65–74, 1997.Google Scholar
  10. 10.
    Y. Chen, G. Dong, J. Han, B.W. Wah, and J. Wang, “Multi-dimensional regression analysis of time-series data streams,” in Proc. 2002 Int. Conf. Very Large Data Bases (VLDB'02), Hong Kong, China, Aug. 2002, pp. 323–334.Google Scholar
  11. 11.
    G. Dong, J. Han, J. Lam, J. Pei, and K. Wang, “Mining multi-dimensional constrained gradients in data cubes,” in Proc. 2001 Int. Conf. on Very Large Data Bases (VLDB'01), Rome, Italy, Sept. 2001, pp. 321–330.Google Scholar
  12. 12.
    J. Gehrke, F. Korn, and D. Srivastava, “On computing correlated aggregates over continuous data streams,” in Proc. 2001 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'01), Santa Barbara, CA, May 2001, pp. 13–24.Google Scholar
  13. 13.
    C. Giannella, J. Han, J. Pei, X. Yan, and P.S. Yu, “Mining frequent patterns in data streams at multiple time granularities,” in H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha (eds), Data Mining: Next Generation Challenges and Future Directions. AAAI/MIT Press, 2004.Google Scholar
  14. 14.
    A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss, “Surfing wavelets on streams: One-pass summaries for approximate aggregate queries,” in Proc. 2001 Int. Conf. on Very Large Data Bases (VLDB'01), Rome, Italy, Sept. 2001, pp. 79–88.Google Scholar
  15. 15.
    J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh, “Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals,” Data Mining and Knowledge Discovery, vol. 1, pp. 29–54, 1997.CrossRefGoogle Scholar
  16. 16.
    M. Greenwald and S. Khanna, “Space-efficient online computation of quantile summaries,” in Proc. 2001 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'01), Santa Barbara, CA, May 2001, pp. 58–66.Google Scholar
  17. 17.
    S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering data streams,” in Proc. IEEE Symposium on Foundations of Computer Science (FOCS'00), Redondo Beach, CA, 2000, pp. 359–366.Google Scholar
  18. 18.
    J. Han, J. Pei, G. Dong, and K. Wang, “Efficient computation of iceberg cubes with complex measures,” in Proc. 2001 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'01), Santa Barbara, CA, May 2001, pp. 1–12.Google Scholar
  19. 19.
    V. Harinarayan, A. Rajaraman, and J.D. Ullman, “Implementing data cubes efficiently,” in Proc. 1996 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'96), Montreal, Canada, June 1996, pp. 205–216.Google Scholar
  20. 20.
    G. Hulten, L. Spencer, and P. Domingos, “Mining time-changing data streams,” in Proc. 2001 ACM SIGKDD Int. Conf. Knowledge Discovery in Databases (KDD'01), San Fransisco, CA, Aug. 2001.Google Scholar
  21. 21.
    T. Imielinski, L. Khachiyan, and A. Abdulghani, “Cubegrades: Generalizing association rules,” Data Mining and Knowledge Discovery, vol. 6, pp. 219–258, 2002.MathSciNetCrossRefGoogle Scholar
  22. 22.
    X. Li, J. Han, and H. Gonzalez, “High-dimensional OLAP: A minimal cubing approach,” in Proc. 2004 Int. Conf. Very Large Data Bases (VLDB'04), Toronto, Canada, Aug. 2004, pp. 528–539.Google Scholar
  23. 23.
    G. Manku and R. Motwani, “Approximate frequency counts over data streams,” in Proc. 2002 Int. Conf. Very Large Data Bases (VLDB'02), Hong Kong, China, Aug. 2002, pp. 346–357.Google Scholar
  24. 24.
    S. Sarawagi, R. Agrawal, and N. Megiddo, “Discovery-driven exploration of OLAP data cubes,” in Proc. Int. Conf. of Extending Database Technology (EDBT'98), Valencia, Spain, March 1998, pp. 168–182.Google Scholar
  25. 25.
    G. Sathe and S. Sarawagi, “Intelligent rollups in multidimensional OLAP data,” in Proc. 2001 Int. Conf. on Very Large Data Bases (VLDB'01), Rome, Italy, Sept. 2001, pp. 531–540.Google Scholar
  26. 26.
    Z. Shao, J. Han, and D. Xin, “MM-Cubing: Computing iceberg cubes by factorizing the lattice space,” in Proc. 2004 Int. Conf. on Scientific and Statistical Database Management (SSDBM'04), Santorini Island, Greece, June 2004, pp. 213–222.Google Scholar
  27. 27.
    H. Wang, W. Fan, P.S. Yu, and J. Han, “Mining concept-drifting data streams using ensemble classifiers,” in Proc. 2003 ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD'03), Washington, DC, Aug. 2003.Google Scholar
  28. 28.
    D. Xin, J. Han, X. Li, and B.W. Wah, “Star-cubing: Computing iceberg cubes by top-down and bottom-up integration,” in Proc. 2003 Int. Conf. Very Large Data Bases (VLDB'03), Berlin, Germany, Sept. 2003.Google Scholar
  29. 29.
    Y. Zhao, P.M. Deshpande, and J.F. Naughton, “An array-based algorithm for simultaneous multidimensional aggregates,” in Proc. 1997 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'97), Tucson, Arizona, May 1997, pp. 159–170.Google Scholar

Copyright information

© Springer Science + Business Media, Inc. 2005

Authors and Affiliations

  • Jiawei Han
    • 1
  • Yixin Chen
    • 2
  • Guozhu Dong
    • 3
  • Jian Pei
    • 4
  • Benjamin W. Wah
    • 1
  • Jianyong Wang
    • 5
  • Y. Dora Cai
    • 1
  1. 1.University of IllinoisIllinois
  2. 2.Washington UniversitySt. Louis
  3. 3.Wright State UniversityUSA
  4. 4.Simon Fraser UniversityB. C.Canada
  5. 5.Tsinghua UniversityBeijingChina

Personalised recommendations