Skip to main content

Building Engines and Platforms for the Big Data Age

  • Conference paper
  • First Online:
Enabling Real-Time Business Intelligence (BIRTE 2014, BIRTE 2013)

Abstract

Big data analytics involves the collection of real-time operational data into large clusters, followed by the execution of analytics queries to derive insights from the data. The results of these insights are periodically deployed into the real-time pipeline, in order to perform business actions or raise alerts. We are currently witnessing a move towards fast data analytics, where some of the offline activities may be performed in memory, directly over the real-time input streams, in order to reduce the time taken to derive and exploit insights from the data. Further, there is an increasing emphasis on enabling data scientists to derive quick approximate insights from large volumes of offline data interactively and at low cost, i.e., without having to process the entire dataset each time. Such hybrid and interconnected workflows across offline and real-time data, stored and processed across multiple machines, and with varying latency needs and complex application logic, requires a rethinking of both data and query processing models and software artifacts that realize such models. This paper surveys the challenges and requirements created by such workflows, and summarizes our research efforts on addressing these problems.

B. Chandramouli—This paper is based on joint work with Jonathan Goldstein, Mike Barnett, Rob DeLine, Danyel Fisher, John C. Platt, James F. Terwilliger, John Wernsing, Justin Levandoski, Suman Nath, Ivo Santos, Songyun Duan, Wenchao Zhou, Abdul Quamar, and several other collaborators.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 34.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 44.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abadi, D.J., et al.: The design of the borealis stream processing system. In: CIDR (2005)

    Google Scholar 

  2. Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. TKDE 17(6), 734–749 (2005)

    Google Scholar 

  3. Agarwal, S., et al.: BlinkDB: queries with bounded errors and bounded response times on very large data. In: EuroSys (2013)

    Google Scholar 

  4. Akidau, T. et al.: MillWheel: fault-tolerant stream processing at internet scale. In: VLDB (2013)

    Google Scholar 

  5. Ali, M. et al.: Microsoft CEP server and online behavioral targeting. In: VLDB (2009)

    Google Scholar 

  6. Apache hadoop. http://hadoop.apache.org/

  7. Apache storm. http://storm.incubator.apache.org/

  8. Babcock, B., et al.: Models and issues in data stream systems. In: PODS (2002)

    Google Scholar 

  9. Barga, R.S., et al.: Consistent streaming through time: a vision for event stream processing. In: CIDR, pp. 363–374 (2007)

    Google Scholar 

  10. Barnett, M., et al.: Stat! - an interactive analytics environment for big data. In: SIGMOD (2013)

    Google Scholar 

  11. Berkeley data analytics stack (BDAS). https://amplab.cs.berkeley.edu/software/

  12. Bernstein, P., et al.: Orleans: distributed virtual actors for programmability and scalability. MSR Technical report (MSR-TR-2014-41, 24). http://aka.ms/Ykyqft

  13. BlinkDB. http://blinkdb.org/

  14. Building real-time services for halo. http://research.microsoft.com/apps/video/?id=198324

  15. Cetintemel, U., et al.: S-Store: a streaming new SQL system for big velocity applications. In: VLDB (2014)

    Google Scholar 

  16. Chaiken, R., et al.: SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 1(2), 1265–1276 (2008)

    Google Scholar 

  17. Chandramouli, B., Levandoski, J.J., Eldawy, A., Mokbel, M.: StreamRec: a real-time recommender system. In: SIGMOD (2011)

    Google Scholar 

  18. Chandramouli, B., Nath, S., Zhou, W.: Supporting distributed feed-following apps over edge devices. PVLDB 6(13), 1570–1581 (2013)

    Google Scholar 

  19. Chandramouli, B., Goldstein, J., Barnett, M., DeLine, R., Fisher, D., Platt, J.C., Terwilliger, J.F., Wernsing, J.: Trill: a high-performance incremental query processor for diverse analytics. In: VLDB (2015, to appear)

    Google Scholar 

  20. Chandramouli, B., Goldstein, J., Duan, S.: Temporal analytics on big data for web advertising. In: ICDE (2012)

    Google Scholar 

  21. Chandramouli, B., Goldstein, J., Maier, D.: On-the-fly progress detection in iterative stream queries. In: VLDB (2009)

    Google Scholar 

  22. Chandramouli, B., Goldstein, J., Maier, D.: High-Performance dynamic pattern matching over disordered streams. In: VLDB (2010)

    Google Scholar 

  23. Chandramouli, B., Goldstein, J., Quamar, A.: Scalable progressive analytics on big data in the cloud. PVLDB 6(14), 1726–1737 (2013)

    Google Scholar 

  24. Chen, Y., et al.: Large-scale behavioral targeting. In: KDD (2009)

    Google Scholar 

  25. Chun, B., et al.: REEF: retainable evaluator execution framework. PVLDB 6(12), 1370–1373 (2013)

    Google Scholar 

  26. Data platforms landscape map. http://blogs.the451group.com/information_management/2014/11/18/updated-data-platforms-landscape-map/

  27. Fisher, D., Chandramouli, B., DeLine, R., Goldstein, J., Aron, A., Barnett, M., Platt, J.C., Terwilliger, J.F., Wernsing, J.: Tempe: an interactive data science environment for exploration of temporal and streaming data. MSR Technical report MSR-TR-2014–148 (2014). http://research.microsoft.com/apps/pubs/?id=232385. Accessed Nov 2014

  28. Hammad, M., et al.: NILE: a query processing engine for data streams. In: ICDE (2004)

    Google Scholar 

  29. Hellerstein, J.M., Avnur, R.: Informix under control: online query processing. J. Data Min. Knowl. Discov. 12, 281–314 (2000)

    Article  Google Scholar 

  30. Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: SIGMOD (1997)

    Google Scholar 

  31. Internet live stats. http://www.internetlivestats.com/

  32. Jain, N., et al.: Towards a streaming SQL standard. In: VLDB (2008)

    Google Scholar 

  33. Jensen, C., Snodgrass, R.: Temporal specialization. In: ICDE (1992)

    Google Scholar 

  34. Li, B., et al.: A platform for scalable one-pass analytics using MapReduce. In: SIGMOD, pp. 985–996 (2011)

    Google Scholar 

  35. Liarou, E., et al.: Enhanced stream processing in a DBMS kernel. In: EDBT (2013)

    Google Scholar 

  36. Lim, H., et al.: How to fit when no one size fits. In: CIDR (2013)

    Google Scholar 

  37. Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed GraphLab: a framework for machine learning in the cloud. In: VLDB (2012)

    Google Scholar 

  38. Maier, D., Li, J., Tucker, P., Tufte, K., Papadimos, V.: Semantics of data streams and operators. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 37–52. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  39. Microsoft SQL server. http://www.microsoft.com/en-us/server-cloud/products/sql-server/

  40. Murray, D., et al.: Naiad: a timely dataflow system. In: SOSP (2013)

    Google Scholar 

  41. Reactive extensions for .NET. http://aka.ms/rx

  42. Santos, I., Tilly, M., Chandramouli, B., Goldstein, J.: DiAl: distributed streaming analytics anywhere anytime. In: VLDB (2013)

    Google Scholar 

  43. The LINQ project. http://aka.ms/rjhi00

  44. Vertica. http://www.vertica.com/

  45. Wu, E., Diao, Y., Rizvi, S.: High-performance complex event processing over streams. In: SIGMOD (2006)

    Google Scholar 

  46. Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI (2012)

    Google Scholar 

  47. Zaharia, M., et al.: Discretized streams: fault-tolerant streaming computation at scale. In: SOSP (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Badrish Chandramouli .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chandramouli, B. (2015). Building Engines and Platforms for the Big Data Age. In: Castellanos, M., Dayal, U., Pedersen, T., Tatbul, N. (eds) Enabling Real-Time Business Intelligence. BIRTE BIRTE 2014 2013. Lecture Notes in Business Information Processing, vol 206. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-46839-5_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-46839-5_3

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-46838-8

  • Online ISBN: 978-3-662-46839-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics