Scalable Splitting of Massive Data Streams

  • Erik Zeitler
  • Tore Risch
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5982)


Scalable execution of continuous queries over massive data streams often requires splitting input streams into parallel sub-streams over which query operators are executed in parallel. Automatic stream splitting is in general very difficult, as the optimal parallelization may depend on application semantics. To enable application specific stream splitting, we introduce splitstream functions where the user specifies non-procedural stream partitioning and replication. For high-volume streams, the stream splitting itself becomes a performance bottleneck. A cost model is introduced that estimates the performance of splitstream functions with respect to throughput and CPU usage. We implement parallel splitstream functions, and relate experimental results to cost model estimates. Based on the results, a splitstream function called autosplit is proposed, which scales well for high degrees of parallelism, and is robust for varying proportions of stream partitioning and replication. We show how user defined parallelization using autosplit provides substantially improved scalability (L = 64) over previously published results for the Linear Road Benchmark.


Distributed stream systems parallelization query optimization 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Arasu, A., et al.: Linear Road: A Stream Data Management Benchmark. In: VLDB (2004)Google Scholar
  2. 2.
    Balazinska, M., Balakrishnan, H., Stonebraker, M.: Contract-Based Load Management in Federated Distributed Systems. In: NSDI (2004)Google Scholar
  3. 3.
    Chaiken, R., et al.: SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. In: VLDB (2008)Google Scholar
  4. 4.
    Cranor, C., Johnson, T., Spataschek, O., Shkapenyuk, V.: Gigascope: A Stream Database for Network Applications. In: SIGMOD (2003)Google Scholar
  5. 5.
    Das, S., Antony, S., Agrawal, D., El Abbadi, A.: Thread Cooperation in Multicore Architectures for Frequency Counting over Multiple Data Streams. In: VLDB (2009)Google Scholar
  6. 6.
    Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI (2004)Google Scholar
  7. 7.
    Gidofalvi, G., Pedersen, T.B., Risch, T., Zeitler, E.: Highly scalable trip grouping for large-scale collective transportation systems. In: EDBT (2008)Google Scholar
  8. 8.
    Girod, L., Mei, Y., Newton, R., Rost, S., Thiagarajan, A., Balakrishnan, H., Madden, S.: XStream: A Signal-Oriented Data Stream Management System. In: ICDE (2008)Google Scholar
  9. 9.
    Risch, T., Josifovski, V., Katchaounov, T.: Functional Data Integration in a Distributed Mediator System. In: Gray, P.M.D., Kerschberg, L., King, P.J.H., Poulovassilis, A. (eds.) The Functional Approach to Data Management (2004)Google Scholar
  10. 10.
    Isard, M., et al.: Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. ACM SIGOPS Operating Systems Review 41, 59–72 (2007)CrossRefGoogle Scholar
  11. 11.
  12. 12.
    Ivanova, M., Risch, T.: Customizable Parallel Execution of Scientific Stream Queries. In: VLDB (2005)Google Scholar
  13. 13.
    Jain, N., et al.: Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core. In: SIGMOD (2006)Google Scholar
  14. 14.
    Johnson, S., Muthukrishnan, Shkapenyuk, V., Spatscheck, O.: Query-Aware Partitioning for Monitoring Massive Network Data Streams. In: SIGMOD (2008)Google Scholar
  15. 15.
    Liu, B., Zhu, Y., Rundensteiner, E.A.: Run-Time Operator State Spilling for Memory Intensive Long-Running Queries. In: SIGMOD (2006)Google Scholar
  16. 16.
    Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems, 2nd edn. Prentice-Hall, Englewood Cliffs (1999)Google Scholar
  17. 17.
  18. 18.
    Shah, M.A., Hellerstein, J.M., Chandrasekaran, S., Franklin, M.J.: Flux: An Adaptive Partitioning Operator for Continuous Query Systems. In: ICDE (2002)Google Scholar
  19. 19.
    Xing, Y., Zdonik, S., Hwang, J.-H.: Dynamic Load Distribution in the Borealis Stream Processor. In: ICDE (2005)Google Scholar
  20. 20.
    Yang, H., Dasdan, A., Hsiao, R.-L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: SIGMOD (2007)Google Scholar
  21. 21.
    Zeitler, E., Risch, T.: Processing high-volume stream queries on a supercomputer. In: ICDE Workshops (2006)Google Scholar
  22. 22.
    Zeitler, E., Risch, T.: Using stream queries to measure communication performance of a parallel computing environment. In: ICDCS Workshops (2007)Google Scholar
  23. 23.
    Zhou, Y., Ooi, B.C., Tan, K.-L.: Efficient Dynamic Operator Placement in a Locally Distributed Continuous Query System. In: Meersman, R., Tari, Z. (eds.) OTM 2006. LNCS, vol. 4275, pp. 54–71. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  24. 24.
    Zhou, Y., Aberer, K., Tan, K.-L.: Toward massive query optimization in large-scale distributed stream systems. In: Middleware (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Erik Zeitler
    • 1
  • Tore Risch
    • 1
  1. 1.Department of Information TechnologyUppsala UniversitySweden

Personalised recommendations