Advertisement

On the role of message broker middleware for many-task computing on a big-data platform

  • Cao Ngoc Nguyen
  • Jaehwan Lee
  • Soonwook Hwang
  • Jik-Soo Kim
Article

Abstract

We have designed and implemented a new data processing framework called “Many-task computing On HAdoop” (MOHA) which aims to effectively support fine-grained many-task applications that can show another type of data-intensive workloads in the YARN-based Hadoop 2.0 platform. MOHA is developed as one of Hadoop YARN applications so that it can transparently co-host existing many-task computing (MTC) applications with other data processing workflows such as MapReduce in a single Hadoop cluster. In this paper, we investigate main characteristics of two well-known open-source message broker middleware systems (Apache ActiveMQ and Kafka) and their implications on a many-task management scheme in our MOHA framework. Through our extensive experiments with a real MTC application, we demonstrate and discuss trade-offs between parallelism and load balancing of data access patterns in message broker middleware systems for Many-Task Computing on Hadoop.

Keywords

Many-task computing Message broker middleware Hadoop YARN ActiveMQ Kafka MOHA Load balancing 

Notes

Acknowledgements

This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. R0190-16-2012, High Performance Big Data Analytics Platform Performance Acceleration Technologies Development), and Basic Science Research Pro- gram through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (No. 2015R1C1A1A02036524).

References

  1. 1.
    Raicu, I., Foster, I., Wilde, M., Zhang, Z., Iskra, K., Beckman, P., Zhao, Y., Szalay, A., Choudhary, A., Little, P., et al.: Middleware support for many-task computing. Clust. Comput. 13(3), 291–314 (2010)CrossRefGoogle Scholar
  2. 2.
    Raicu, I., Foster, I.T., Zhao, Y.: Many-task computing for grids and supercomputers. In: Many-Task Computing on Grids and Supercomputers, 2008. MTAGS 2008. Workshop on, pp. 1–11. IEEE (2008)Google Scholar
  3. 3.
    The Apache Hadoop project: Open-source software for reliable, scalable, distributed computing. http://hadoop.apache.org/
  4. 4.
    Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al.: Apache hadoop yarn: Yet another resource negotiator. In: Proceedings of the 4th annual Symposium on Cloud Computing, p. 5. ACM (2013)Google Scholar
  5. 5.
    Apache Spark: Lighting-fast cluster computing. https://spark.apache.org/
  6. 6.
    Apache Storm: A free and open source distributed realtime computation system. http://storm.apache.org/
  7. 7.
    Open MPI: Open Source High Performance Computing. https://www.open-mpi.org/
  8. 8.
    Kim, J.S., Nguyen, C., Hwang, S.: Moha: Many-task computing meets the big data platform. In: e-Science (e-Science), 2016 IEEE 12th International Conference on, pp. 193–202. IEEE (2016)Google Scholar
  9. 9.
    Nguyen, C., Kim, J.S., Lee, J., Hwang, S.: A case study of leveraging high-throughput distributed message queue system for many-task computing on hadoop. In: Foundations and Applications of Self* Systems (FAS* W), 2017 IEEE 2nd International Workshops on, pp. 257–262. IEEE (2017)Google Scholar
  10. 10.
    Apache ActiveMQ: The most popular and powerful open source messaging and Integration Patterns server. http://activemq.apache.org/
  11. 11.
    Apache Kafka: A high-throughput distributed messaging system: http://kafka.apache.org/
  12. 12.
    Kreps, J., Narkhede, N., Rao, J.: Kafka: a distributed messaging system for log processing. In: Proceedings of the 6th International Workshop on Networking Meets Databases (NetDB’11) (2011)Google Scholar
  13. 13.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACS 5(1), 107–113 (2008)CrossRefGoogle Scholar
  14. 14.
    Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop Distributed File System. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST’10) (2010)Google Scholar
  15. 15.
    Mukesh Kumar, “Kafka: A detail introduction. https://www.linkedin.com/pulse/kafka-detail-introduction-mukesh-kumar
  16. 16.
    Ashburn, T.T., Thor, K.B.: Drug repositioning: identifying and developing new uses for existing drugs. Nat. Rev. Drug Discov. 3(8), 673 (2004)CrossRefGoogle Scholar
  17. 17.
    Gabra, N.M., Mustafa, B., Kumar, Y.P., Devi, C.S., Srishailam, A., Reddy, P.V., Reddy, K.L., Satyanarayana, S.: Synthesis, characterization, dna binding studies, photocleavage, cytotoxicity and docking studies of ruthenium (ii) light switch complexes. J. Fluoresc. 24(1), 169–181 (2014)CrossRefGoogle Scholar
  18. 18.
    AutoDock Vina: Molecular docking and virtual screening program. http://vina.scripps.edu/
  19. 19.
    Trott, O., Olson, A.J.: Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 31(2), 455–461 (2010)Google Scholar
  20. 20.
    Luckow, A., Santcroos, M., Weidner, O., Merzky, A., Mantha, P., Jha, S.: P*: a model of pilot-abstractions. In: Proceedings of the 8th IEEE International Conference on eScience (eScience 2012) (2012)Google Scholar
  21. 21.
    Nguyen, C.N., Kim, J.S., Hwang, S.: Koha: Building a kafka-based distributed queue system on the fly in a hadoop cluster. In: Foundations and Applications of Self* Systems, IEEE International Workshops on, pp. 48–53. IEEE (2016)Google Scholar
  22. 22.
    Murthy, A., Vavilapalli, V., Eadline, D., Niemiec, J., Markham, J.: Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2. Addison-Wesley Data & Analytics (2014)Google Scholar
  23. 23.
    Wang, K., Rajendran, A., Raicu, I.: Matrix: Many-task computing execution fabric at exascale. Tech Report, IIT (2013)Google Scholar
  24. 24.
    Kim, J.S., Rho, S., Kim, S., Kim, S., Kim, S., Hwang, S.: Htcaas: leveraging distributed supercomputing infrastructures for large-scale scientific computing. In: IEEE/ACM 6th Workshop on Many-Task Computing on Clouds, Grids, and Supercomputers (MTAGS13) held with SC13 (2013)Google Scholar
  25. 25.
    Rho, S., Kim, S., Kim, S., Kim, S., Kim, J.S., Hwang, S.: Htcaas: a large-scale high-throughput computing by leveraging grids, supercomputers and cloud. In: High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:, pp. 1341–1342. IEEE (2012)Google Scholar
  26. 26.
    Xu, L., Li, M., Butt, A.R.: Gerbil: Mpi+ yarn. In: Cluster, Cloud and Grid Computing (CCGrid). In: 2015 15th IEEE/ACM International Symposium on, pp. 627–636. IEEE (2015)Google Scholar
  27. 27.
    Zafar, H., Khan, F.A., Carpenter, B., Shafi, A., Malik, A.W.: Mpj express meets yarn: towards java hpc on hadoop systems. Procedia Comput. Sci. 51, 2678–2682 (2015)CrossRefGoogle Scholar
  28. 28.
    Baccar, S., Derguech, W., Curry, E., Abid, M.: Modeling and querying sensor services using ontologies. In: International Conference on Business Information Systems, pp. 90–101. Springer (2015)Google Scholar
  29. 29.
    Cafaro, A., Bruijnes, M., van Waterschoot, J., Pelachaud, C., Theune, M., Heylen, D.: Selecting and expressing communicative functions in a saiba-compliant agent framework. In: International Conference on Intelligent Virtual Agents, pp. 73–82. Springer (2017)Google Scholar
  30. 30.
    Treyer, L., Klein, B., König, R., Meixner, C.: Lightweight urban computation interchange (luci) system. In: Proceedings: FOSS4G pp. 421–432 (2015)Google Scholar
  31. 31.
    Cui, X., Dong, Z., Lin, L., Song, R., Yu, X.: Grandland traffic data processing platform. In: Big Data (BigData Congress), 2014 IEEE International Congress on, pp. 766–767. IEEE (2014)Google Scholar
  32. 32.
    Li, K., Deolalikar, V., Pradhan, N.: Big data gathering and mining pipelines for CRM using open-source. In: Big Data (Big Data), 2015 IEEE International Conference on, pp. 2936–2938. IEEE (2015)Google Scholar
  33. 33.
    Celar, S., Mudnic, E., Seremet, Z.: State-of-the-art of messaging for distributed computing systems. Int. J. Vallis Aurea 3(2), 5–18 (2017)Google Scholar
  34. 34.
    Dobbelaere, P., Esmaili, K.S.: Kafka versus rabbitmq: a comparative study of two industry reference publish/subscribe implementations: industry paper. In: Proceedings of the 11th ACM International Conference on Distributed and Event-based Systems, pp. 227–238. ACM (2017)Google Scholar
  35. 35.
    John, V., Liu, X.: A survey of distributed message broker queues. arXiv preprint arXiv:1704.00411 (2017)

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Korea Institute of Science and Technology InformationUniversity of Science & TechnologyDaejeonRepublic of Korea
  2. 2.School of Electronics and Information EngineeringKorea Aerospace UniversityGoyangRepublic of Korea
  3. 3.Department of Computer EngineeringMyongji UniversityYonginRepublic of Korea

Personalised recommendations