Skip to main content

Scalable Splitting of Massive Data Streams

  • Conference paper
Database Systems for Advanced Applications (DASFAA 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5982))

Included in the following conference series:

Abstract

Scalable execution of continuous queries over massive data streams often requires splitting input streams into parallel sub-streams over which query operators are executed in parallel. Automatic stream splitting is in general very difficult, as the optimal parallelization may depend on application semantics. To enable application specific stream splitting, we introduce splitstream functions where the user specifies non-procedural stream partitioning and replication. For high-volume streams, the stream splitting itself becomes a performance bottleneck. A cost model is introduced that estimates the performance of splitstream functions with respect to throughput and CPU usage. We implement parallel splitstream functions, and relate experimental results to cost model estimates. Based on the results, a splitstream function called autosplit is proposed, which scales well for high degrees of parallelism, and is robust for varying proportions of stream partitioning and replication. We show how user defined parallelization using autosplit provides substantially improved scalability (L = 64) over previously published results for the Linear Road Benchmark.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Arasu, A., et al.: Linear Road: A Stream Data Management Benchmark. In: VLDB (2004)

    Google Scholar 

  2. Balazinska, M., Balakrishnan, H., Stonebraker, M.: Contract-Based Load Management in Federated Distributed Systems. In: NSDI (2004)

    Google Scholar 

  3. Chaiken, R., et al.: SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. In: VLDB (2008)

    Google Scholar 

  4. Cranor, C., Johnson, T., Spataschek, O., Shkapenyuk, V.: Gigascope: A Stream Database for Network Applications. In: SIGMOD (2003)

    Google Scholar 

  5. Das, S., Antony, S., Agrawal, D., El Abbadi, A.: Thread Cooperation in Multicore Architectures for Frequency Counting over Multiple Data Streams. In: VLDB (2009)

    Google Scholar 

  6. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI (2004)

    Google Scholar 

  7. Gidofalvi, G., Pedersen, T.B., Risch, T., Zeitler, E.: Highly scalable trip grouping for large-scale collective transportation systems. In: EDBT (2008)

    Google Scholar 

  8. Girod, L., Mei, Y., Newton, R., Rost, S., Thiagarajan, A., Balakrishnan, H., Madden, S.: XStream: A Signal-Oriented Data Stream Management System. In: ICDE (2008)

    Google Scholar 

  9. Risch, T., Josifovski, V., Katchaounov, T.: Functional Data Integration in a Distributed Mediator System. In: Gray, P.M.D., Kerschberg, L., King, P.J.H., Poulovassilis, A. (eds.) The Functional Approach to Data Management (2004)

    Google Scholar 

  10. Isard, M., et al.: Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. ACM SIGOPS Operating Systems Review 41, 59–72 (2007)

    Article  Google Scholar 

  11. iStreams homepage, http://www.it.uu.se/resnearch/group/udbl/html/iStreams.html

  12. Ivanova, M., Risch, T.: Customizable Parallel Execution of Scientific Stream Queries. In: VLDB (2005)

    Google Scholar 

  13. Jain, N., et al.: Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core. In: SIGMOD (2006)

    Google Scholar 

  14. Johnson, S., Muthukrishnan, Shkapenyuk, V., Spatscheck, O.: Query-Aware Partitioning for Monitoring Massive Network Data Streams. In: SIGMOD (2008)

    Google Scholar 

  15. Liu, B., Zhu, Y., Rundensteiner, E.A.: Run-Time Operator State Spilling for Memory Intensive Long-Running Queries. In: SIGMOD (2006)

    Google Scholar 

  16. Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems, 2nd edn. Prentice-Hall, Englewood Cliffs (1999)

    Google Scholar 

  17. SCSQ-LR homepage, http://user.it.uu.se/~udbl/lr.html

  18. Shah, M.A., Hellerstein, J.M., Chandrasekaran, S., Franklin, M.J.: Flux: An Adaptive Partitioning Operator for Continuous Query Systems. In: ICDE (2002)

    Google Scholar 

  19. Xing, Y., Zdonik, S., Hwang, J.-H.: Dynamic Load Distribution in the Borealis Stream Processor. In: ICDE (2005)

    Google Scholar 

  20. Yang, H., Dasdan, A., Hsiao, R.-L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: SIGMOD (2007)

    Google Scholar 

  21. Zeitler, E., Risch, T.: Processing high-volume stream queries on a supercomputer. In: ICDE Workshops (2006)

    Google Scholar 

  22. Zeitler, E., Risch, T.: Using stream queries to measure communication performance of a parallel computing environment. In: ICDCS Workshops (2007)

    Google Scholar 

  23. Zhou, Y., Ooi, B.C., Tan, K.-L.: Efficient Dynamic Operator Placement in a Locally Distributed Continuous Query System. In: Meersman, R., Tari, Z. (eds.) OTM 2006. LNCS, vol. 4275, pp. 54–71. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  24. Zhou, Y., Aberer, K., Tan, K.-L.: Toward massive query optimization in large-scale distributed stream systems. In: Middleware (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zeitler, E., Risch, T. (2010). Scalable Splitting of Massive Data Streams. In: Kitagawa, H., Ishikawa, Y., Li, Q., Watanabe, C. (eds) Database Systems for Advanced Applications. DASFAA 2010. Lecture Notes in Computer Science, vol 5982. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12098-5_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-12098-5_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-12097-8

  • Online ISBN: 978-3-642-12098-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics