Building Engines and Platforms for the Big Data Age

Chandramouli, Badrish

doi:10.1007/978-3-662-46839-5_3

Badrish Chandramouli¹⁰

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 206))

Included in the following conference series:

720 Accesses
1 Citations

Abstract

Big data analytics involves the collection of real-time operational data into large clusters, followed by the execution of analytics queries to derive insights from the data. The results of these insights are periodically deployed into the real-time pipeline, in order to perform business actions or raise alerts. We are currently witnessing a move towards fast data analytics, where some of the offline activities may be performed in memory, directly over the real-time input streams, in order to reduce the time taken to derive and exploit insights from the data. Further, there is an increasing emphasis on enabling data scientists to derive quick approximate insights from large volumes of offline data interactively and at low cost, i.e., without having to process the entire dataset each time. Such hybrid and interconnected workflows across offline and real-time data, stored and processed across multiple machines, and with varying latency needs and complex application logic, requires a rethinking of both data and query processing models and software artifacts that realize such models. This paper surveys the challenges and requirements created by such workflows, and summarizes our research efforts on addressing these problems.

B. Chandramouli—This paper is based on joint work with Jonathan Goldstein, Mike Barnett, Rob DeLine, Danyel Fisher, John C. Platt, James F. Terwilliger, John Wernsing, Justin Levandoski, Suman Nath, Ivo Santos, Songyun Duan, Wenchao Zhou, Abdul Quamar, and several other collaborators.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 34.99; Price excludes VAT (USA)

Softcover Book: USD 44.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abadi, D.J., et al.: The design of the borealis stream processing system. In: CIDR (2005)
Google Scholar
Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. TKDE 17(6), 734–749 (2005)
Google Scholar
Agarwal, S., et al.: BlinkDB: queries with bounded errors and bounded response times on very large data. In: EuroSys (2013)
Google Scholar
Akidau, T. et al.: MillWheel: fault-tolerant stream processing at internet scale. In: VLDB (2013)
Google Scholar
Ali, M. et al.: Microsoft CEP server and online behavioral targeting. In: VLDB (2009)
Google Scholar
Apache hadoop. http://hadoop.apache.org/
Apache storm. http://storm.incubator.apache.org/
Babcock, B., et al.: Models and issues in data stream systems. In: PODS (2002)
Google Scholar
Barga, R.S., et al.: Consistent streaming through time: a vision for event stream processing. In: CIDR, pp. 363–374 (2007)
Google Scholar
Barnett, M., et al.: Stat! - an interactive analytics environment for big data. In: SIGMOD (2013)
Google Scholar
Berkeley data analytics stack (BDAS). https://amplab.cs.berkeley.edu/software/
Bernstein, P., et al.: Orleans: distributed virtual actors for programmability and scalability. MSR Technical report (MSR-TR-2014-41, 24). http://aka.ms/Ykyqft
BlinkDB. http://blinkdb.org/
Building real-time services for halo. http://research.microsoft.com/apps/video/?id=198324
Cetintemel, U., et al.: S-Store: a streaming new SQL system for big velocity applications. In: VLDB (2014)
Google Scholar
Chaiken, R., et al.: SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 1(2), 1265–1276 (2008)
Google Scholar
Chandramouli, B., Levandoski, J.J., Eldawy, A., Mokbel, M.: StreamRec: a real-time recommender system. In: SIGMOD (2011)
Google Scholar
Chandramouli, B., Nath, S., Zhou, W.: Supporting distributed feed-following apps over edge devices. PVLDB 6(13), 1570–1581 (2013)
Google Scholar
Chandramouli, B., Goldstein, J., Barnett, M., DeLine, R., Fisher, D., Platt, J.C., Terwilliger, J.F., Wernsing, J.: Trill: a high-performance incremental query processor for diverse analytics. In: VLDB (2015, to appear)
Google Scholar
Chandramouli, B., Goldstein, J., Duan, S.: Temporal analytics on big data for web advertising. In: ICDE (2012)
Google Scholar
Chandramouli, B., Goldstein, J., Maier, D.: On-the-fly progress detection in iterative stream queries. In: VLDB (2009)
Google Scholar
Chandramouli, B., Goldstein, J., Maier, D.: High-Performance dynamic pattern matching over disordered streams. In: VLDB (2010)
Google Scholar
Chandramouli, B., Goldstein, J., Quamar, A.: Scalable progressive analytics on big data in the cloud. PVLDB 6(14), 1726–1737 (2013)
Google Scholar
Chen, Y., et al.: Large-scale behavioral targeting. In: KDD (2009)
Google Scholar
Chun, B., et al.: REEF: retainable evaluator execution framework. PVLDB 6(12), 1370–1373 (2013)
Google Scholar
Data platforms landscape map. http://blogs.the451group.com/information_management/2014/11/18/updated-data-platforms-landscape-map/
Fisher, D., Chandramouli, B., DeLine, R., Goldstein, J., Aron, A., Barnett, M., Platt, J.C., Terwilliger, J.F., Wernsing, J.: Tempe: an interactive data science environment for exploration of temporal and streaming data. MSR Technical report MSR-TR-2014–148 (2014). http://research.microsoft.com/apps/pubs/?id=232385. Accessed Nov 2014
Hammad, M., et al.: NILE: a query processing engine for data streams. In: ICDE (2004)
Google Scholar
Hellerstein, J.M., Avnur, R.: Informix under control: online query processing. J. Data Min. Knowl. Discov. 12, 281–314 (2000)
Article Google Scholar
Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: SIGMOD (1997)
Google Scholar
Internet live stats. http://www.internetlivestats.com/
Jain, N., et al.: Towards a streaming SQL standard. In: VLDB (2008)
Google Scholar
Jensen, C., Snodgrass, R.: Temporal specialization. In: ICDE (1992)
Google Scholar
Li, B., et al.: A platform for scalable one-pass analytics using MapReduce. In: SIGMOD, pp. 985–996 (2011)
Google Scholar
Liarou, E., et al.: Enhanced stream processing in a DBMS kernel. In: EDBT (2013)
Google Scholar
Lim, H., et al.: How to fit when no one size fits. In: CIDR (2013)
Google Scholar
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed GraphLab: a framework for machine learning in the cloud. In: VLDB (2012)
Google Scholar
Maier, D., Li, J., Tucker, P., Tufte, K., Papadimos, V.: Semantics of data streams and operators. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 37–52. Springer, Heidelberg (2005)
Chapter Google Scholar
Microsoft SQL server. http://www.microsoft.com/en-us/server-cloud/products/sql-server/
Murray, D., et al.: Naiad: a timely dataflow system. In: SOSP (2013)
Google Scholar
Reactive extensions for .NET. http://aka.ms/rx
Santos, I., Tilly, M., Chandramouli, B., Goldstein, J.: DiAl: distributed streaming analytics anywhere anytime. In: VLDB (2013)
Google Scholar
The LINQ project. http://aka.ms/rjhi00
Vertica. http://www.vertica.com/
Wu, E., Diao, Y., Rizvi, S.: High-performance complex event processing over streams. In: SIGMOD (2006)
Google Scholar
Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI (2012)
Google Scholar
Zaharia, M., et al.: Discretized streams: fault-tolerant streaming computation at scale. In: SOSP (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Microsoft Research, One Microsoft Way, Redmond, WA, 98052, USA
Badrish Chandramouli

Authors

Badrish Chandramouli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Badrish Chandramouli .

Editor information

Editors and Affiliations

Hewlett-Packard, Palo Alto, USA
Malu Castellanos
Hitachi Laboratories, Santa Clara, California, USA
Umeshwar Dayal
Aalborg University, Aalborg, Denmark
Torben Bach Pedersen
Intel Labs and MIT, Cambridge, Massachusetts, USA
Nesime Tatbul

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chandramouli, B. (2015). Building Engines and Platforms for the Big Data Age. In: Castellanos, M., Dayal, U., Pedersen, T., Tatbul, N. (eds) Enabling Real-Time Business Intelligence. BIRTE BIRTE 2014 2013. Lecture Notes in Business Information Processing, vol 206. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-46839-5_3

Download citation

DOI: https://doi.org/10.1007/978-3-662-46839-5_3
Published: 30 April 2015
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-46838-8
Online ISBN: 978-3-662-46839-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics