Advertisement

The VLDB Journal

, Volume 13, Issue 4, pp 333–353 | Cite as

Operator scheduling in data stream systems

  • Brian BabcockEmail author
  • Shivnath Babu
  • Mayur Datar
  • Rajeev Motwani
  • Dilys Thomas
Article

Abstract.

In many applications involving continuous data streams, data arrival is bursty and data rate fluctuates over time. Systems that seek to give rapid or real-time query responses in such an environment must be prepared to deal gracefully with bursts in data arrival without compromising system performance. We discuss one strategy for processing bursty streams - adaptive, load-aware scheduling of query operators to minimize resource consumption during times of peak load. We show that the choice of an operator scheduling strategy can have significant impact on the runtime system memory usage as well as output latency. Our aim is to design a scheduling strategy that minimizes the maximum runtime system memory while maintaining the output latency within prespecified bounds. We first present Chain scheduling, an operator scheduling strategy for data stream systems that is near-optimal in minimizing runtime memory usage for any collection of single-stream queries involving selections, projections, and foreign-key joins with stored relations. Chain scheduling also performs well for queries with sliding-window joins over multiple streams and multiple queries of the above types. However, during bursts in input streams, when there is a buildup of unprocessed tuples, Chain scheduling may lead to high output latency. We study the online problem of minimizing maximum runtime memory, subject to a constraint on maximum latency. We present preliminary observations, negative results, and heuristics for this problem. A thorough experimental evaluation is provided where we demonstrate the potential benefits of Chain scheduling and its different variants, compare it with competing scheduling strategies, and validate our analytical conclusions.

Keywords:

Data streams Scheduling Memory management Latency 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Amsaleg L, Franklin M, Tomasic A (1998) Dynamic query operator scheduling for wide-area remote access. J Distrib Parallel Databases 6(3):217-246CrossRefGoogle Scholar
  2. 2.
    Arasu A, Babu S, Widom J (2002) An abstract semantics and concrete language for continuous queries over streams and relations. Technical report, Stanford University Database Group. http://dbpubs.stanford.edu/pub/2002-57Google Scholar
  3. 3.
    Avnur R, Hellerstein J (2000) Eddies: continuously adaptive query processing. In: Proc 2000 ACM SIGMOD international conference on management of data, pp 261-272Google Scholar
  4. 4.
    Ayad AM, Naughton JF (2004) Static optimization of conjunctive queries with sliding windows over infinite streams. In: Proc 2004 ACM SIGMOD international conference on management of dataGoogle Scholar
  5. 5.
    Babcock B, Babu S, Datar M, Motwani R (2003) Chain: operator scheduling for memory minimization in data stream systems. In: Proc 2003 ACM SIGMOD international conference on management of dataGoogle Scholar
  6. 6.
    Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Proc 2002 ACM symposium on principles of database systemsGoogle Scholar
  7. 7.
    Babcock B, Datar M, Motwani R (2004) Load shedding for aggregation queries over data streams. In: Proc 2004 international conference on data engineering, pp 350-361Google Scholar
  8. 8.
    Babu S, Motwani R, Munagala K, Nishizawa I, Widom J (2004) Adaptive ordering of pipelined stream filters. In: Proc 2004 ACM SIGMOD international conference on management of dataGoogle Scholar
  9. 9.
    Bouganim L, Kapitskaia O, Valduriez P (1998) Memory-adaptive scheduling for large query execution. In: Proc 1998 ACM CIKM international conference on information and knowledge management, pp 105-115Google Scholar
  10. 10.
    Carney D, Cetintemel U, Cherniack M, Convey C, Lee S, Seidman G, Stonebraker M, Tatbul N, Zdonik S (2002) Monitoring streams - a new class of data management applications. In: Proc 28th international conference on very large data basesGoogle Scholar
  11. 11.
    Carney D, Cetintemel U, Rasin A, Zdonik S, Cherniack M, Stonebraker M (2003) Operator scheduling in a data stream manager. In: Proc 2003 international conference on very large data basesGoogle Scholar
  12. 12.
    Chandrasekaran S, Franklin M (2002) Streaming queries over streaming data. In: Proc 28th international conference on very large data basesGoogle Scholar
  13. 13.
    Chaudhurim S, Shim K (1999) Optimization of queries with user-defined predicates. ACM Trans Database Sys 24(2):177-228CrossRefGoogle Scholar
  14. 14.
    Cortes C, Fisher K, Pregibon D, Rogers A, Smith F (2000) Hancock: a language for extracting signatures from data streams. In: Proc 2000 ACM SIGKDD international conference on knowledge discovery and data mining, pp 9-17Google Scholar
  15. 15.
    Dageville B, Zait M (2002) SQL memory management in Oracle9i. In: Proc 2002 international conference on very large data basesGoogle Scholar
  16. 16.
    Das A, Gehrke J, Riedewald M (2003) Approximate join processing over data streams. In: Proc 2003 ACM SIGMOD international conference on management of dataGoogle Scholar
  17. 17.
    Floyd S, Paxson V (1995) Wide-area traffic: the failure of poisson modeling. IEEE/ACM Trans Network 3(3):226-244Google Scholar
  18. 18.
    Golab L, Ozsu T (2003) Issues in data stream management. SIGMOD Record 32(2):5-14Google Scholar
  19. 19.
    Hellerstein J, Franklin M, Chandrasekaran S, Deshpande A, Hildrum K, Madden S, Raman V, Shah MA (2000) Adaptive query processing: technology in evolution. IEEE Data Eng Bull 23(2):7-18Google Scholar
  20. 20.
    Hellerstein J, Stonebraker M (1993) Predicate migration: optimizing queries with expensive predicates. In: Proc 1993 ACM SIGMOD international conference on management of data, pp 267-276Google Scholar
  21. 21.
    Ibaraki T, Kameda T (1984) On the optimal nesting order for computing n-relational joins. ACM Trans Database Sys 9(3):482-502CrossRefGoogle Scholar
  22. 22.
    Internet Traffic Archive: http://www.acm.org/sigcomm/ITA/Google Scholar
  23. 23.
    Ives Z, Florescu D, Friedman M, Levy A, Weld D (1999) An adaptive query execution system for data integration. In: Proc 1999 ACM SIGMOD international conference on management of data, pp 299-310Google Scholar
  24. 24.
    Johnson T, Cranor C, Spatsheck O, Shkapenyuk V (2003) Gigascope: a stream database for network applications. In: Proc 2003 ACM SIGMOD international conference on management of dataGoogle Scholar
  25. 25.
    Kabra N, DeWitt DJ (1998) Efficient mid-query re-optimization of sub-optimal query execution plans. In: Proc ACM SIGMOD international conference on management of data, pp 106-117Google Scholar
  26. 26.
    Kang J, Naughton JF, Viglas S (2003) Evaluating window joins over unbounded streams. In: Proc 2003 international conference on data engineeringGoogle Scholar
  27. 27.
    Kao B, Garcia-Molina H (1995) An overview of real-time database systems. In: Son SH (ed) Advances in real-time systems. Prentice Hall, Englewood Cliffs, NJ, pp 463-486Google Scholar
  28. 28.
    Karger D, Stein C, Wein J (1997) Scheduling algorithms. In: Atallah MJ (ed) Handbook of algorithms and theory of computation. CRC, Boca Raton, FLGoogle Scholar
  29. 29.
    Kleinberg J (2002) Bursty and hierarchical structure in streams. In: Proc 2002 ACM SIGKDD international conference on knowledge discovery and data miningGoogle Scholar
  30. 30.
    Krishnamurthy R, Boral H, Zaniolo C (1986) Optimizing nonrecursive queries. In: Proc 1986 international conference on very large data bases, pp 128-137Google Scholar
  31. 31.
    Lawler EL, Lenstra JK, Rinnooy Kan AHG, Shmoys DB (1993) Sequencing and scheduling: algorithms and complexity. In: Graves SC, Zipkin PH, Rinnooy Kan AHG (eds) Logistics of production and inventory, Handbooks in operations research and management science, vol 4, North-Holland, Amsterdam, pp 445-522Google Scholar
  32. 32.
    Leland W, Taqqu M, Willinger W, Wilson D (1994) On the self-similar nature of ethernet traffic. IEEE/ACM Trans Network 2(1):1-15Google Scholar
  33. 33.
    Lomet D, Levy A (2000) Special issue on adaptive query processing. IEEE Data Eng Bull 23(2):1-48Google Scholar
  34. 34.
    Madden S, Shah M, Hellerstein J, Raman V (2002) Continuously adaptive continuous queries over streams. In: Proc 2002 ACM SIGMOD international conference on management of dataGoogle Scholar
  35. 35.
    Monma C, Sidney J (1987) Optimal sequencing via modular decomposition: characterization of sequencing functions. Math Oper Res 12:22-31MathSciNetzbMATHGoogle Scholar
  36. 36.
    Motwani R, Thomas D (2004) Caching queues in memory buffers. In: Proc 2004 annual ACM-SIAM symposium on discrete algorithmsGoogle Scholar
  37. 37.
    Motwani R, Widom J, Arasu A, Babcock B, Babu S, Datar M, Manku G, Olston C, Rosenstein J, Varma R (2003) Query processing, approximation, and resource management in a data stream management system. In: Proc 1st biennial conference on innovative data systems research (CIDR)Google Scholar
  38. 38.
    Nag B, DeWitt DJ (1998) Memory allocation strategies for complex decision support queries. In: Proc 1998 ACM CIKM international conference on information and knowledge management, pp 116-123Google Scholar
  39. 39.
    Niagara Project. http://www.cs.wisc.edu/niagara/Google Scholar
  40. 40.
    Parker DS, Muntz RR, Chau HL (1989) The tangram stream query processing system. In: Proc 1989 international conference on data engineering, pp 556-563Google Scholar
  41. 41.
    Parker DS, Simon E, Valduriez P (1992) SVP: a model capturing sets, lists, streams, and parallelism. In: Proc 1992 international conference on very large data bases, pp 115-126Google Scholar
  42. 42.
    Raman V, Deshpande A, Hellerstein J (2003) Using state modules for adaptive query processing. In: Proc 2003 international conference on data engineeringGoogle Scholar
  43. 43.
    Shah M, Madden S, Franklin M, Hellerstein J (2001) Java support for data-intensive systems: experiences building the telegraph dataflow system. SIGMOD Record 30(4):103-114Google Scholar
  44. 44.
    SQR - a stream query repository. http://www-db.stanford.edu/stream/sqrGoogle Scholar
  45. 45.
    Stanford Stream Data Management (STREAM) Project. http://www-db.stanford.edu/streamGoogle Scholar
  46. 46.
    Sullivan M (1996) Tribeca: a stream database manager for network traffic analysis. In: Proc 1996 international conference on very large data bases, p 594Google Scholar
  47. 47.
    Tatbul N, Cetintemel U, Zdonik S, Cherniack M, Stonebraker M (2003) Load shedding in a data stream manager. In: Proc 2003 international conference on very large data bases, pp 309-320Google Scholar
  48. 48.
    Terry D, Goldberg D, Nichols D, Oki B (1992) Continuous queries over append-only databases. In: Proc 1992 ACM SIGMOD international conference on management of data, pp 321-330Google Scholar
  49. 49.
    Urhan T, Franklin M (2000) Xjoin: a reactively-scheduled pipelined join operator. IEEE Data Eng Bull 23(2):27-33Google Scholar
  50. 50.
    Urhan T, Franklin MJ (2001) Dynamic pipeline scheduling for improving interactive performance of online queries. In: Proc 2001 international conference on very large data basesGoogle Scholar
  51. 51.
    Urhan T, Franklin MJ, Amsaleg L (1998) Cost-based query scrambling for initial delays. In: Proc 1998 ACM SIGMOD international conference on management of data, pp 130-141Google Scholar
  52. 52.
    Viglas S, Naughton J (2002) Rate-based query optimization for streaming information sources. In: Proc 2002 ACM SIGMOD international conference on management of dataGoogle Scholar
  53. 53.
    Willinger W, Paxson V, Riedi R, Taqqu M (2002) Long-range dependence and data network traffic. In: Doukhan P, Oppenheim G, Taqqu MS (eds) Long-range dependence: theory and applications. Birkhäuser, Basel, SwitzerlandGoogle Scholar
  54. 54.
    Willinger W, Taqqu M, Erramilli A (1996) A bibliographical guide to self-similar traffic and performance modeling for modern high-speed networks. In: Kelly FP, Zachary S, Ziedins I (eds) Stochastic networks: theory and applications. Oxford University Press, Oxford, UK, pp 339-366Google Scholar
  55. 55.
    Wilschut AN, Apers PMG (1991) Dataflow query execution in a parallel main-memory environment. In: Proc 1991 international conference on parallel and distributed information systems, pp 68-77Google Scholar

Copyright information

© Springer-Verlag Berlin/Heidelberg 2004

Authors and Affiliations

  • Brian Babcock
    • 1
    Email author
  • Shivnath Babu
    • 1
  • Mayur Datar
    • 1
  • Rajeev Motwani
    • 1
  • Dilys Thomas
    • 1
  1. 1.Department of Computer ScienceStanford UniversityStanfordUSA

Personalised recommendations