The VLDB Journal

, Volume 25, Issue 1, pp 3–26 | Cite as

epiC: an extensible and scalable system for processing Big Data

  • Dawei Jiang
  • Sai Wu
  • Gang Chen
  • Beng Chin Ooi
  • Kian-Lee Tan
  • Jun Xu
Special Issue Paper

Abstract

The Big Data problem is characterized by the so-called 3V features: volume—a huge amount of data, velocity—a high data ingestion rate, and variety—a mix of structured data, semi-structured data, and unstructured data. The state-of-the-art solutions to the Big Data problem are largely based on the MapReduce framework (aka its open source implementation Hadoop). Although Hadoop handles the data volume challenge successfully, it does not deal with the data variety well since the programming interfaces and its associated data processing model are inconvenient and inefficient for handling structured data and graph data. This paper presents epiC, an extensible system to tackle the Big Data’s data variety challenge. epiC introduces a general Actor-like concurrent programming model, independent of the data processing models, for specifying parallel computations. Users process multi-structured datasets with appropriate epiC extensions, and the implementation of a data processing model best suited for the data type and auxiliary code for mapping that data processing model into epiC’s concurrent programming model. Like Hadoop, programs written in this way can be automatically parallelized and the runtime system takes care of fault tolerance and inter-machine communications. We present the design and implementation of epiC’s concurrent programming model. We also present two customized data processing models, an optimized MapReduce extension and a relational model, on top of epiC. We show how users can leverage epiC to process heterogeneous data by linking different types of operators together. To improve the performance of complex analytic jobs, epiC supports a partition-based optimization technique where data are streamed between the operators to avoid the high I/O overheads. Experiments demonstrate the effectiveness and efficiency of our proposed epiC.

Keywords

Parallel processing MapReduce Pregel Hadoop 

References

  1. 1.
    Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: Hadoopdb: an architectural hybrid of mapreduce and DBMS technologies for analytical workloads. PVLDB 2(1) (2009)Google Scholar
  2. 2.
    Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In: SoCC (2010)Google Scholar
  3. 3.
    Bentley, J.L., Sedgewick, R.: Fast algorithms for sorting and searching strings. In: SODA (1997)Google Scholar
  4. 4.
    Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. VLDB 3(1–2), 285–296 (2010)Google Scholar
  5. 5.
    Chattopadhyay, B., Lin, L., Liu, W., Mittal, S., Aragonda, P., Lychagina, V., Kwon, Y., Wong, M., Wong, M.: Tenzing a SQL implementation on the mapreduce framework (2011)Google Scholar
  6. 6.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)Google Scholar
  7. 7.
    DeWitt, D.J., Gerber, R.H., Graefe, G., Heytens, M.L., Kumar, K.B., Muralikrishna, M.: Gamma—a high performance dataflow database machine. In: VLDB (1986)Google Scholar
  8. 8.
    DeWitt, D.J., Gray, J.: Parallel database systems: the future of high performance database systems. Commun. ACM 35(6) (1992)Google Scholar
  9. 9.
    DeWitt, D.J., Halverson, A., Nehme, R., Shankar, S., Aguilar-Saborit, J., Avanes, A., Flasza, M., Gramling, J.: Split query processing in polybase. In: SIGMOD Conference (2013)Google Scholar
  10. 10.
    Feng, J., Wang, J., Li, G.: Trie-join: a trie-based method for efficient string similarity joins. VLDB J 21(4), 437–461 (2012)CrossRefGoogle Scholar
  11. 11.
    Finkel, J.R., Grenager, T., Manning, C.D.: Incorporating non-local information into information extraction systems by gibbs sampling. In: ACL 2005, 43rd Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 25–30 June 2005, University of Michigan, USA (2005)Google Scholar
  12. 12.
    Fushimi, S., Kitsuregawa, M., Tanaka, H.: An overview of the system software of a parallel relational database machine grace. In: VLDB (1986)Google Scholar
  13. 13.
    Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: Bigbench: towards an industry standard benchmark for big data analytics. In: SIGMOD Conference (2013)Google Scholar
  14. 14.
    Hewitt, C., Bishop, P., Steiger, R.: A universal modular actor formalism for artificial intelligence. In: IJCAI (1973)Google Scholar
  15. 15.
    Hu, M., Liu, B.: Mining and summarizing customer reviews. In: SIGKDD, pp. 168–177 (2004)Google Scholar
  16. 16.
    Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. SIGOPS Oper. Syst. Rev. 41(3), 59–72 ( 2007)Google Scholar
  17. 17.
    Isard, M., Yu, Y.: Distributed data-parallel computing using a high-level programming language. In: SIGMOD Conference (2009)Google Scholar
  18. 18.
    Jestes, J., Li, F., Yan, Z., Yi, K.: Probabilistic string similarity joins. In: SIGMOD, pp. 327–338 (2010)Google Scholar
  19. 19.
    Jiang, D., Ooi, B.C., Shi, L., Wu, S.: The performance of mapreduce: an in-depth study. PVLDB 3(1–2), 472–483 (2010)Google Scholar
  20. 20.
    Jiang, D., Tung, A.K.H., Chen, G.: Map-join-reduce: toward scalable and efficient data analysis on large clusters. IEEE Trans. Knowl. Data Eng. 23(9) (2011)Google Scholar
  21. 21.
    Li, F., Ooi, B.C., Özsu, M.T., Wu, S.: Distributed data management using mapreduce. ACM Comput. Surv. 46(3) (2014). doi:10.1145/2503009
  22. 22.
    Li, G., Deng, D., Wang, J., Feng, J.: Pass-join: a partition-based method for similarity joins. Proc. VLDB Endow. 5(3), 253–264 (2011)CrossRefGoogle Scholar
  23. 23.
    Macdonald, I.: Symmetric Functions and Hall Polynomials, 2nd edn. Clarendon Press, Oxford (1998)MATHGoogle Scholar
  24. 24.
    Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: SIGMOD Conference (2010)Google Scholar
  25. 25.
    Neumeyer, L., Robbins, B., Nair, A., Kesari. A.: S4: Distributed stream computing platform. In: ICDMW, pp. 170–177 (2010)Google Scholar
  26. 26.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web. Technical Report 1999-66, Stanford InfoLab (1999). Previous number = SIDL-WP-1999-0120Google Scholar
  27. 27.
    Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD Conference (2009)Google Scholar
  28. 28.
    Salihoglu, S., Widom, J.: Gps: a graph processing system. In: SSDBM Technical Report (2013)Google Scholar
  29. 29.
    Sinha, R., Zobel, J.: Cache-conscious sorting of large sets of strings with dynamic tries. J. Exp. Algorithmics 9, 93–105 (2004)Google Scholar
  30. 30.
    Stonebraker, M., Abadi, D.J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E., O’Neil, P., Rasin, A., Tran, N., Zdonik, S.: C-store: a column-oriented DBMS. In: VLDB (2005)Google Scholar
  31. 31.
    Su, X., Swart, G.: Oracle in-database hadoop: when mapreduce meets rdbms. In: SIGMOD Conference, pp. 779–790 (2012)Google Scholar
  32. 32.
    The hadoop offical website. http://hadoop.apache.org/
  33. 33.
    The storm project offical website. http://storm-project.net/
  34. 34.
    Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. VLDB 2(2), 1626–1629 (2009)Google Scholar
  35. 35.
    Wu, S., Li, F., Mehrotra, S., Ooi, B.C.: Query optimization for massively parallel data processing. In: SoCC (2011)Google Scholar
  36. 36.
    Yang, H., Dasdan, A., Hsiao, R., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: SIGMOD Conference (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  1. 1.School of ComputingNational University of SingaporeSingaporeSingapore
  2. 2.College of Computer Science and TechnologyZhejiang UniversityHangzhouChina
  3. 3.School of Computer Science and TechnologyHarbin Institute of TechnologyHarbinChina

Personalised recommendations