Cluster Computing

, Volume 20, Issue 3, pp 2095–2106 | Cite as

Making a case for the on-demand multiple distributed message queue system in a Hadoop cluster

  • Cao Ngoc Nguyen
  • Soonwook Hwang
  • Jik-Soo KimEmail author


In this paper, we present a framework that can provide users with a simple, convenient and powerful way to deploy multiple message queue system on demand in a Hadoop cluster. Specifically, we are leveraging the Apache Kafka which is one of the state of art distributed message queue systems that can achieve high throughput, low latency, and good load balancing. Our framework provides automation of setting up and starting Kafka brokers on the fly and users can leverage the framework to quickly adopt Kafka without spending much efforts on installation and configuration challenges. In addition, the framework supports users to run their Kafka-based applications without detailed knowledge about the Hadoop YARN APIs and underlying mechanisms. We present a use case of the framework to evaluate Kafka’s performance with various test cases and working scenarios. The experimental results allow Kafka’s potential users to perceive the influences of different settings on the queuing performance.


Distributed message queue Kafka Hadoop YARN Many-task computing MOHA 



This work was supported by Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) (No. R0190-16-2012, High Performance Big Data Analytics Platform Performance Acceleration Technologies Development).


  1. 1.
    Apache Kafka: A high-throughput distributed messaging system. (2017). Accessed 8 July 2017
  2. 2.
    Apache Kafka use cases. (2017). Accessed 8 July 2017
  3. 3.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008)CrossRefGoogle Scholar
  4. 4.
    He, C., Weitzel, D., Swanson, D., Lu, Y.: HOG: distributed Hadoop MapReduce on the grid. In: Proceedings of the 5th Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) 2012 in conjunction with SC12 (2012)Google Scholar
  5. 5.
    Hintjens, P.: ZeroMQ: Messaging for Many Applications. O’Reilly Media, Inc., Newton (2013)Google Scholar
  6. 6.
    Introducing KOYA Apache Kafka on YARN. (2017). Accessed 8 July 2017
  7. 7.
    Kim, J.S., Nguyen, C., Hwang, S.: MOHA: many-task computing meets the big data platform. In: IEEE 12th International Conference on eScience (eScience 2016) (2016)Google Scholar
  8. 8.
    Kreps, J., Narkhede, N., Rao, J., et al.: Kafka: a distributed messaging system for log processing. In: Proceedings of the NetDB (2011)Google Scholar
  9. 9.
    Liu, G., Wood, T.: Cloud-scale application performance monitoring with SDN and NFV. In: 2015 IEEE International Conference on Cloud Engineering (IC2E), pp. 440–445. IEEE, New York (2015)Google Scholar
  10. 10.
    Lu, X., Liang, F., Wang, B., Zha, L., Xu, Z.: DataMPI: extending MPI to Hadoop-like big data computing. In: Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium (IPDPS ’14) (2014)Google Scholar
  11. 11.
    Murthy, A., Vavilapalli, V., Eadline, D., Niemiec, J., Markham, J.: Apache Hadoop YARN: Moving Beyond MapReduce and Batch Processing with Apache Hadoop 2. Addison-Wesley Data & Analytics, New York (2014)Google Scholar
  12. 12.
    Murthy, A.C., Vavilapalli, V.K., Eadline, D., Niemiec, J., Markham, J.: Apache Hadoop YARN: Moving Beyond MapReduce and Batch Processing with Apache Hadoop 2. Pearson Education, Upper Saddle River (2013)Google Scholar
  13. 13.
    Nannoni, N.: Message-oriented middleware for scalable data analytics architectures. Master’s thesis, KTH—Information and Communication Technology School (2015)Google Scholar
  14. 14.
    Nguyen, C., Kim, J.S., Hwang, S.: KOHA: building a Kafka-based distributed queue system on the fly in a Hadoop cluster. In: 2016 IEEE 1st International Workshops on Foundations and Applications of Self-* Systems (2016)Google Scholar
  15. 15.
    Preuveneers, D., Berbers, Y., Joosen Samurai, W.: A batch and streaming context architecture for large-scale intelligent applications and environments. J. Ambient Intell. Smart Environ. 8(1), 63–78 (2016)CrossRefGoogle Scholar
  16. 16.
    Raicu, I., Foster, I., Wilde, M., Zhang, Z., Iskra, K., Beckman, P., Zhao, Y., Szalay, A., Choudhary, A., Little, P., et al.: Middleware support for many-task computing. Cluster Comput. 13(3), 291–314 (2010)CrossRefGoogle Scholar
  17. 17.
    Raicu, I., Foster, I., Zhao, Y.: Many-task computing for grids and supercomputers. In: Proceedings of the Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS’08) (2008)Google Scholar
  18. 18.
    Richardson, A., et al.: Introduction to RabbitMQ—An Open Source Message Broker That Just Works. Google, London (2008)Google Scholar
  19. 19.
    Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST’10) (2010)Google Scholar
  20. 20.
    Snyder, B., Bosanac, D., Davies, R.: Introduction to apache activeMQ. In: ActiveMQ in Action, pp. 6–16Google Scholar
  21. 21.
    The Apache Hadoop Project: Open-source software for reliable, scalable, distributed computing. (2017). Accessed 8 July 2017
  22. 22.
    Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing (SoCC’13) (2013)Google Scholar
  23. 23.
    Xu, L., Li, M., Butt, A.R.: GERBIL: MPI+YARN. In: Proceedings of the 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) (2015)Google Scholar
  24. 24.
    Ye, J., Chow, J.H., Chen, J., Zheng, Z.: Stochastic gradient boosted distributed decision trees. In: Proceedings of the 18th ACM conference on Information and knowledge management (CIKM’09) (2009)Google Scholar
  25. 25.
    Zookeeper: A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. (2017). Accessed 8 July 2017

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.Korea Institute of Science and Technology InformationUniversity of Science & TechnologyDaejeonRepublic of Korea
  2. 2.Department of Computer EngineeringMyongji UniversityYonginRepublic of Korea

Personalised recommendations