Scalable Splitting of Massive Data Streams

Zeitler, Erik; Risch, Tore

doi:10.1007/978-3-642-12098-5_15

Erik Zeitler²⁰ &
Tore Risch²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5982))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

2124 Accesses
10 Citations
3 Altmetric

Abstract

Scalable execution of continuous queries over massive data streams often requires splitting input streams into parallel sub-streams over which query operators are executed in parallel. Automatic stream splitting is in general very difficult, as the optimal parallelization may depend on application semantics. To enable application specific stream splitting, we introduce splitstream functions where the user specifies non-procedural stream partitioning and replication. For high-volume streams, the stream splitting itself becomes a performance bottleneck. A cost model is introduced that estimates the performance of splitstream functions with respect to throughput and CPU usage. We implement parallel splitstream functions, and relate experimental results to cost model estimates. Based on the results, a splitstream function called autosplit is proposed, which scales well for high degrees of parallelism, and is robust for varying proportions of stream partitioning and replication. We show how user defined parallelization using autosplit provides substantially improved scalability (L = 64) over previously published results for the Linear Road Benchmark.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Arasu, A., et al.: Linear Road: A Stream Data Management Benchmark. In: VLDB (2004)
Google Scholar
Balazinska, M., Balakrishnan, H., Stonebraker, M.: Contract-Based Load Management in Federated Distributed Systems. In: NSDI (2004)
Google Scholar
Chaiken, R., et al.: SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. In: VLDB (2008)
Google Scholar
Cranor, C., Johnson, T., Spataschek, O., Shkapenyuk, V.: Gigascope: A Stream Database for Network Applications. In: SIGMOD (2003)
Google Scholar
Das, S., Antony, S., Agrawal, D., El Abbadi, A.: Thread Cooperation in Multicore Architectures for Frequency Counting over Multiple Data Streams. In: VLDB (2009)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI (2004)
Google Scholar
Gidofalvi, G., Pedersen, T.B., Risch, T., Zeitler, E.: Highly scalable trip grouping for large-scale collective transportation systems. In: EDBT (2008)
Google Scholar
Girod, L., Mei, Y., Newton, R., Rost, S., Thiagarajan, A., Balakrishnan, H., Madden, S.: XStream: A Signal-Oriented Data Stream Management System. In: ICDE (2008)
Google Scholar
Risch, T., Josifovski, V., Katchaounov, T.: Functional Data Integration in a Distributed Mediator System. In: Gray, P.M.D., Kerschberg, L., King, P.J.H., Poulovassilis, A. (eds.) The Functional Approach to Data Management (2004)
Google Scholar
Isard, M., et al.: Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. ACM SIGOPS Operating Systems Review 41, 59–72 (2007)
Article Google Scholar
iStreams homepage, http://www.it.uu.se/resnearch/group/udbl/html/iStreams.html
Ivanova, M., Risch, T.: Customizable Parallel Execution of Scientific Stream Queries. In: VLDB (2005)
Google Scholar
Jain, N., et al.: Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core. In: SIGMOD (2006)
Google Scholar
Johnson, S., Muthukrishnan, Shkapenyuk, V., Spatscheck, O.: Query-Aware Partitioning for Monitoring Massive Network Data Streams. In: SIGMOD (2008)
Google Scholar
Liu, B., Zhu, Y., Rundensteiner, E.A.: Run-Time Operator State Spilling for Memory Intensive Long-Running Queries. In: SIGMOD (2006)
Google Scholar
Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems, 2nd edn. Prentice-Hall, Englewood Cliffs (1999)
Google Scholar
SCSQ-LR homepage, http://user.it.uu.se/~udbl/lr.html
Shah, M.A., Hellerstein, J.M., Chandrasekaran, S., Franklin, M.J.: Flux: An Adaptive Partitioning Operator for Continuous Query Systems. In: ICDE (2002)
Google Scholar
Xing, Y., Zdonik, S., Hwang, J.-H.: Dynamic Load Distribution in the Borealis Stream Processor. In: ICDE (2005)
Google Scholar
Yang, H., Dasdan, A., Hsiao, R.-L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: SIGMOD (2007)
Google Scholar
Zeitler, E., Risch, T.: Processing high-volume stream queries on a supercomputer. In: ICDE Workshops (2006)
Google Scholar
Zeitler, E., Risch, T.: Using stream queries to measure communication performance of a parallel computing environment. In: ICDCS Workshops (2007)
Google Scholar
Zhou, Y., Ooi, B.C., Tan, K.-L.: Efficient Dynamic Operator Placement in a Locally Distributed Continuous Query System. In: Meersman, R., Tari, Z. (eds.) OTM 2006. LNCS, vol. 4275, pp. 54–71. Springer, Heidelberg (2006)
Chapter Google Scholar
Zhou, Y., Aberer, K., Tan, K.-L.: Toward massive query optimization in large-scale distributed stream systems. In: Middleware (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Technology, Uppsala University, Sweden
Erik Zeitler & Tore Risch

Authors

Erik Zeitler
View author publications
You can also search for this author in PubMed Google Scholar
Tore Risch
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Graduate School of Systems and Information Engineering, University of Tsukuba, 305–8573, Tennodai, Tsukuba, Ibaraki, Japan
Hiroyuki Kitagawa
Information Technology Center, Nagoya University, 464-8601, Furo-cho, Chikusa-ku, Nagoya, Japan
Yoshiharu Ishikawa
Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong, China
Qing Li
Department of Information Science, Ochanomizu University, 2-1-1, Otsuka, Bunkyo-ku, 112-8610, Tokyo, Japan
Chiemi Watanabe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zeitler, E., Risch, T. (2010). Scalable Splitting of Massive Data Streams. In: Kitagawa, H., Ishikawa, Y., Li, Q., Watanabe, C. (eds) Database Systems for Advanced Applications. DASFAA 2010. Lecture Notes in Computer Science, vol 5982. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12098-5_15

Download citation

DOI: https://doi.org/10.1007/978-3-642-12098-5_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12097-8
Online ISBN: 978-3-642-12098-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics