Integrating Map-Reduce and Stream-Processing for Efficiency (MRSP)

  • Pedro Martins
  • Maryam Abbasi
  • José Cecílio
  • Pedro Furtado
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 716)

Abstract

Works in the field of data warehousing (DW) do not address Stream Processing (SP) integration in order to provide results freshness (i.e. results that include information that is not yet stored into the DW) and at the same time to relax the DW processing load. Previous research works focus mainly on parallelization, for instance: adding more hardware resources; parallelizing operators, queries, and storage. A very known and studied approach is to use Map-Reduce to scale horizontally in order to achieve more storage and processing performance. In many contexts, high-rate data needs to be processed in small time windows without storing results (e.g. for near real-time monitoring), in other cases, the objective is to relax the data warehouse usage (e.g. keeping results updated for web-pages reload). In both cases, stream processing solutions can be set to work together with the data warehouse (Map-Reduce or not) to keep results available on the fly avoiding high query execution times, and, this way leaving the DW servers more available to process other heavy tasks (e.g. data mining).

In this work, we propose the integration of Stream Processing and Map-Reduce (MRSP) for better query and DW performance. This approach allows to relax the data warehouse load, and, by consequence reducing the network usage. This mechanism integrates into Map-Reduce scalability mechanisms and uses the Map-Reduce nodes to process Stream queries.

Results show/compare performance gains on the DW side and the quality of experience (QoE) when executing queries and loading data.

Keywords

Complex event processing Stream processing Extraction transformation and load Distributed system Data warehouse Big data Small data Map-Reduce 

References

  1. 1.
    Amini, L., Andrade, H., Bhagwan, R., Eskesen, F., King, R., Selo, P., Park, Y., Venkatramani, C.: SPC: a distributed, scalable platform for data mining. In: Proceedings of the 4th International Workshop on Data Mining Standards, Services and Platforms, pp. 27–37. ACM (2006)Google Scholar
  2. 2.
    Arasu, A., Babcock, B., Babu, S., Cieslewicz, J., Datar, M., Ito, K., Motwani, R., Srivastava, U., Widom, J.: STREAM: the stanford data stream management system. In: Garofalakis, M., Gehrke, J., Rastogi, R. (eds.) Data Stream Management. DSA, pp. 317–336. Springer, Heidelberg (2016). doi:10.1007/978-3-540-28608-0_16 CrossRefGoogle Scholar
  3. 3.
    Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., et al.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394. ACM (2015)Google Scholar
  4. 4.
    Balazinska, M., Balakrishnan, H., Madden, S.R., Stonebraker, M.: Fault-tolerance in the borealis distributed stream processing system. ACM Trans. Database Syst. (TODS) 33(1), 3 (2008)CrossRefGoogle Scholar
  5. 5.
    Chandrasekaran, S., Cooper, O., Deshpande, A., Franklin, M.J., Hellerstein, J.M., Hong, W., Krishnamurthy, S., Madden, S.R., Reiss, F., Shah, M.A.: TelegraphCQ: continuous dataflow processing. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 668–668. ACM (2003)Google Scholar
  6. 6.
    Cherniack, M., Balakrishnan, H., Balazinska, M., Carney, D., Cetintemel, U., Xing, Y., Zdonik, S.B.: Scalable distributed stream processing. In: CIDR, vol. 3, pp. 257–268 (2003)Google Scholar
  7. 7.
    Council, T.P.P.: TPC-H benchmark specification (2008). http://www.tcp.org/hspec.html
  8. 8.
    Cugola, G., Margara, A.: Processing flows of information: from data stream to complex event processing. ACM Comput. Surv. (CSUR) 44(3), 15 (2012)CrossRefGoogle Scholar
  9. 9.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  10. 10.
    DeWitt, D., Stonebraker, M.: MapReduce: a major step backwards. Database Column 1, 23 (2008)Google Scholar
  11. 11.
    Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: ACM SIGOPS Operating Systems Review, vol. 37(5), pp. 29–43. ACM (2003)Google Scholar
  12. 12.
    He, B., Yang, M., Guo, Z., Chen, R., Su, B., Lin, W., Zhou, L.: Comet: batched stream processing for data intensive distributed computing. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 63–74. ACM (2010)Google Scholar
  13. 13.
    Hoffman, S.: Apache Flume: Distributed Log Collection for Hadoop. Packt Publishing Ltd, Birmingham (2013)Google Scholar
  14. 14.
    Krishnamurthy, S., Franklin, M.J., Davis, J., Farina, D., Golovko, P., Li, A., Thombre, N.: Continuous analytics over discontinuous streams. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 1081–1092. ACM (2010)Google Scholar
  15. 15.
    Li, M., Zeng, L., Meng, S., Tan, J., Zhang, L., Butt, A.R., Fuller, N.: MRONLINE: MapReduce online performance tuning. In: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, pp. 165–176. ACM (2014)Google Scholar
  16. 16.
    Logothetis, D., Olston, C., Reed, B., Webb, K.C., Yocum, K.: Stateful bulk processing for incremental analytics. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 51–62. ACM (2010)Google Scholar
  17. 17.
    Logothetis, D., Trezzo, C., Webb, K.C., Yocum, K.: In-situ MapReduce for Log processing. In: 2011 USENIX Annual Technical Conference (USENIX ATC 2011), p. 115 (2011)Google Scholar
  18. 18.
    McSherry, F., Isaacs, R., Isard, M., Murray, D.G.: Naiad: the animating spirit of rivers and streams. SOSP Poster Session (2011)Google Scholar
  19. 19.
    Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1099–1110. ACM (2008)Google Scholar
  20. 20.
    Ongaro, D., Rumble, S.M., Stutsman, R., Ousterhout, J., Rosenblum, M.: Fast crash recovery in RAMCloud. In: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pp. 29–41. ACM (2011)Google Scholar
  21. 21.
    Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in action. Manning Shelter Island (2011)Google Scholar
  22. 22.
    Peng, D., Dabek, F.: Large-scale incremental processing using distributed transactions and notifications. In: OSDI, vol. 10, pp. 1–15 (2010)Google Scholar
  23. 23.
    Rajakumar, E., Raja, R.: An overview of data warehousing and OLAP technology. Adv. Nat. Appl. Sci. 9(6 SE), 288–297 (2015)Google Scholar
  24. 24.
    Ranjan, R.: Streaming big data processing in datacenter clouds. IEEE Cloud Comput. 1, 78–83 (2014)CrossRefGoogle Scholar
  25. 25.
    Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endowment 2(2), 1626–1629 (2009)CrossRefGoogle Scholar
  26. 26.
    Wang, C., Rayan, I.A., Schwan, K.: Faster, larger, easier: reining real-time big data processing in cloud. In: Proceedings of the Posters and Demo Track, p. 4. ACM (2012)Google Scholar
  27. 27.
    Wang, M., Li, B., Zhao, Y., Pu, G.: Formalizing Google file system. In: 2014 IEEE 20th Pacific Rim International Symposium on Dependable Computing (PRDC), pp. 190–191. IEEE (2014)Google Scholar
  28. 28.
    White, T.: Hadoop: The Definitive Guide. O’Reilly Media Inc., Sebastopol (2012)Google Scholar
  29. 29.
    Xing, Y., Zdonik, S., Hwang, J.H.: Dynamic load distribution in the borealis stream processor. In: Proceedings. 21st International Conference on Data Engineering, ICDE 2005, pp. 791–802. IEEE (2005)Google Scholar
  30. 30.
    Yang, H.c., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-Reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 1029–1040. ACM (2007)Google Scholar
  31. 31.
    Zaharia, M., Das, T., Li, H., Shenker, S., Stoica, I.: Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. HotCloud 12, 10 (2012)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Pedro Martins
    • 1
  • Maryam Abbasi
    • 1
  • José Cecílio
    • 1
  • Pedro Furtado
    • 1
  1. 1.Polytechnic Institute of Viseu, Department of Computer SciencesUniversity of Coimbra (CISUC Research Group)CoimbraPortugal

Personalised recommendations