Abstract
Big data analytics involves the collection of real-time operational data into large clusters, followed by the execution of analytics queries to derive insights from the data. The results of these insights are periodically deployed into the real-time pipeline, in order to perform business actions or raise alerts. We are currently witnessing a move towards fast data analytics, where some of the offline activities may be performed in memory, directly over the real-time input streams, in order to reduce the time taken to derive and exploit insights from the data. Further, there is an increasing emphasis on enabling data scientists to derive quick approximate insights from large volumes of offline data interactively and at low cost, i.e., without having to process the entire dataset each time. Such hybrid and interconnected workflows across offline and real-time data, stored and processed across multiple machines, and with varying latency needs and complex application logic, requires a rethinking of both data and query processing models and software artifacts that realize such models. This paper surveys the challenges and requirements created by such workflows, and summarizes our research efforts on addressing these problems.
B. Chandramouli—This paper is based on joint work with Jonathan Goldstein, Mike Barnett, Rob DeLine, Danyel Fisher, John C. Platt, James F. Terwilliger, John Wernsing, Justin Levandoski, Suman Nath, Ivo Santos, Songyun Duan, Wenchao Zhou, Abdul Quamar, and several other collaborators.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abadi, D.J., et al.: The design of the borealis stream processing system. In: CIDR (2005)
Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. TKDE 17(6), 734–749 (2005)
Agarwal, S., et al.: BlinkDB: queries with bounded errors and bounded response times on very large data. In: EuroSys (2013)
Akidau, T. et al.: MillWheel: fault-tolerant stream processing at internet scale. In: VLDB (2013)
Ali, M. et al.: Microsoft CEP server and online behavioral targeting. In: VLDB (2009)
Apache hadoop. http://hadoop.apache.org/
Apache storm. http://storm.incubator.apache.org/
Babcock, B., et al.: Models and issues in data stream systems. In: PODS (2002)
Barga, R.S., et al.: Consistent streaming through time: a vision for event stream processing. In: CIDR, pp. 363–374 (2007)
Barnett, M., et al.: Stat! - an interactive analytics environment for big data. In: SIGMOD (2013)
Berkeley data analytics stack (BDAS). https://amplab.cs.berkeley.edu/software/
Bernstein, P., et al.: Orleans: distributed virtual actors for programmability and scalability. MSR Technical report (MSR-TR-2014-41, 24). http://aka.ms/Ykyqft
BlinkDB. http://blinkdb.org/
Building real-time services for halo. http://research.microsoft.com/apps/video/?id=198324
Cetintemel, U., et al.: S-Store: a streaming new SQL system for big velocity applications. In: VLDB (2014)
Chaiken, R., et al.: SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 1(2), 1265–1276 (2008)
Chandramouli, B., Levandoski, J.J., Eldawy, A., Mokbel, M.: StreamRec: a real-time recommender system. In: SIGMOD (2011)
Chandramouli, B., Nath, S., Zhou, W.: Supporting distributed feed-following apps over edge devices. PVLDB 6(13), 1570–1581 (2013)
Chandramouli, B., Goldstein, J., Barnett, M., DeLine, R., Fisher, D., Platt, J.C., Terwilliger, J.F., Wernsing, J.: Trill: a high-performance incremental query processor for diverse analytics. In: VLDB (2015, to appear)
Chandramouli, B., Goldstein, J., Duan, S.: Temporal analytics on big data for web advertising. In: ICDE (2012)
Chandramouli, B., Goldstein, J., Maier, D.: On-the-fly progress detection in iterative stream queries. In: VLDB (2009)
Chandramouli, B., Goldstein, J., Maier, D.: High-Performance dynamic pattern matching over disordered streams. In: VLDB (2010)
Chandramouli, B., Goldstein, J., Quamar, A.: Scalable progressive analytics on big data in the cloud. PVLDB 6(14), 1726–1737 (2013)
Chen, Y., et al.: Large-scale behavioral targeting. In: KDD (2009)
Chun, B., et al.: REEF: retainable evaluator execution framework. PVLDB 6(12), 1370–1373 (2013)
Data platforms landscape map. http://blogs.the451group.com/information_management/2014/11/18/updated-data-platforms-landscape-map/
Fisher, D., Chandramouli, B., DeLine, R., Goldstein, J., Aron, A., Barnett, M., Platt, J.C., Terwilliger, J.F., Wernsing, J.: Tempe: an interactive data science environment for exploration of temporal and streaming data. MSR Technical report MSR-TR-2014–148 (2014). http://research.microsoft.com/apps/pubs/?id=232385. Accessed Nov 2014
Hammad, M., et al.: NILE: a query processing engine for data streams. In: ICDE (2004)
Hellerstein, J.M., Avnur, R.: Informix under control: online query processing. J. Data Min. Knowl. Discov. 12, 281–314 (2000)
Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: SIGMOD (1997)
Internet live stats. http://www.internetlivestats.com/
Jain, N., et al.: Towards a streaming SQL standard. In: VLDB (2008)
Jensen, C., Snodgrass, R.: Temporal specialization. In: ICDE (1992)
Li, B., et al.: A platform for scalable one-pass analytics using MapReduce. In: SIGMOD, pp. 985–996 (2011)
Liarou, E., et al.: Enhanced stream processing in a DBMS kernel. In: EDBT (2013)
Lim, H., et al.: How to fit when no one size fits. In: CIDR (2013)
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed GraphLab: a framework for machine learning in the cloud. In: VLDB (2012)
Maier, D., Li, J., Tucker, P., Tufte, K., Papadimos, V.: Semantics of data streams and operators. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 37–52. Springer, Heidelberg (2005)
Microsoft SQL server. http://www.microsoft.com/en-us/server-cloud/products/sql-server/
Murray, D., et al.: Naiad: a timely dataflow system. In: SOSP (2013)
Reactive extensions for .NET. http://aka.ms/rx
Santos, I., Tilly, M., Chandramouli, B., Goldstein, J.: DiAl: distributed streaming analytics anywhere anytime. In: VLDB (2013)
The LINQ project. http://aka.ms/rjhi00
Vertica. http://www.vertica.com/
Wu, E., Diao, Y., Rizvi, S.: High-performance complex event processing over streams. In: SIGMOD (2006)
Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI (2012)
Zaharia, M., et al.: Discretized streams: fault-tolerant streaming computation at scale. In: SOSP (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chandramouli, B. (2015). Building Engines and Platforms for the Big Data Age. In: Castellanos, M., Dayal, U., Pedersen, T., Tatbul, N. (eds) Enabling Real-Time Business Intelligence. BIRTE BIRTE 2014 2013. Lecture Notes in Business Information Processing, vol 206. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-46839-5_3
Download citation
DOI: https://doi.org/10.1007/978-3-662-46839-5_3
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-46838-8
Online ISBN: 978-3-662-46839-5
eBook Packages: Computer ScienceComputer Science (R0)