Abstract
With the increasing adoption of Big Data technologies as basic tools for the ongoing Digital Transformation, there is a high demand for data-intensive applications. In order to efficiently execute such applications, it is vital that cloud providers change the way hardware infrastructure resources are managed to improve their performance. However, the increasing use of virtualization technologies to achieve an efficient usage of infrastructure resources continuously widens the gap between applications and the underlying hardware, thus decreasing resource efficiency for the end user. Moreover, this scenario is especially troublesome for Big Data applications, as storage resources are one of the most heavily virtualized, thus imposing a significant overhead for large-scale data processing. This paper proposes a novel PaaS architecture specifically oriented for Big Data where the scheduler offers disks as resources alongside the more common CPU and memory resources, looking forward to provide a better storage solution for the user. Furthermore, virtualization overheads are reduced to the bare minimum by replacing heavy hypervisor-based technologies with operating-system-level virtualization based on light software containers. This architecture has been deployed on a Big Data infrastructure at the CESGA supercomputing center, used as a testbed to compare its performance with OpenStack, a popular private cloud platform. Results have shown significant performance improvements, reducing the execution time of representative Big Data workloads by up to 4.5×.
This is a preview of subscription content, access via your institution.
References
Amazon Web Services (AWS): https://aws.amazon.com/. Last visited: June 2018
Axboe, J.: FIO tool github site. https://github.com/axboe/fio. Last visited: June 2018
Bakshi, K.: Considerations for Big Data: architecture and approach. In: IEEE Aerospace Conference, AeroConf’12, pp 1–7. Big Sky (2012)
Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer, R., Pratt, I., Warfield, A.: Xen and the art of virtualization. In: 19th ACM Symposium on Operating Systems Principles, SOSP’03, pp 164–177. Bolton Landing (2003)
Bernstein, D.: Containers and cloud: from LXC to Docker to Kubernetes. IEEE Cloud Comput. 1 (3), 81–84 (2014)
Big Data Evaluator (BDEv): http://bdev.des.udc.es/. Last visited: June 2018
Bryk, P., Malawski, M., Juve, G., Deelman, E.: Storage-aware algorithms for scheduling of workflow ensembles in clouds. J. Grid Comput. 14(2), 359–378 (2016)
Buyya, R., Yeo, C.S., Venugopal, S., Broberg, J., Brandic, I.: Cloud computing and emerging IT platforms: vision, hype, and reality for delivering computing as the 5th utility. Futur. Gener. Comput. Syst. 25(6), 599–616 (2009)
Caballer, M., Zala, S., García, Á.L., Moltó, G., Fernández, P.O., Velten, M.: Orchestrating complex application architectures in heterogeneous clouds. J. Grid Comput. 16(1), 3–18 (2018)
CESGA Supercomputing Center website: http://www.cesga.es/. Last visited: June 2018
Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking cloud serving systems with YCSB. In: 1st ACM Symposium on Cloud Computing, SoCC’10, pp 143–154. Indianapolis (2010)
Darwin, P.B., Kozlowski, P.: AngularJS web application development. Packt Publishing (2013)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Dua, R., Raja, A.R., Kakadia, D.: Virtualization vs containerization to support PaaS. In: IEEE International Conference on Cloud Engineering, IC2E’14, pp 610–614. Boston (2014)
Expósito, R.R., Taboada, G.L., Ramos, S., González-Domínguez, J., Touriño, J., Doallo, R.: Analysis of I/O performance on an Amazon EC2 cluster compute and high I/O platform. J. Grid Comput. 11(4), 613–631 (2013)
Ghoshal, D., Canon, R.S., Ramakrishnan, L.: I/O performance of virtualized cloud environments. In: 2nd International Workshop on Data Intensive Computing in the Clouds, DataCloud-SC’11, pp 71–80. Seattle (2011)
Google Compute Engine (GCE): https://cloud.google.com/compute/. Last visited: June 2018
Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R.H., Shenker, S., Stoica, I.: Mesos: a platform for fine-grained resource sharing in the data center. In: 8th USENIX Symposium on Networked Systems Design and Implementation, NSDI’11, pp 295–308. Boston (2011)
Jacobs, A.: The pathologies of Big Data. Commun. ACM 52(8), 36–44 (2009)
Ji, C., Li, Y., Qiu, W., Awada, U., Li, K.: Big Data processing in cloud computing environments. In: 12th International Symposium on Pervasive Systems, Algorithms and Networks, I-SPAN’12, pp 17–23. San Marcos (2012)
Kaisler, S., Armour, F., Espinosa, J.A., Money, W.: Big Data: issues and challenges moving forward. In: 46th Hawaii International Conference on System Sciences, HICSS’13, pp 995–1004. Wailea (2013)
Katal, A., Wazid, M., Goudar, R.H.: Big Data: issues, challenges, tools and good practices. In: 6th International Conference on Contemporary Computing, IC3’13, pp 404–409. Noida (2013)
Kivity, A., Kamay, Y., Laor, D., Lublin, U., Liguori, A.: KVM: the Linux virtual machine monitor. In: Ottawa Linux Symposium, OLS’07, pp 225–230. Ottawa (2007)
Li, A., Yang, X., Kandula, S., Zhang, M.: CloudCmp: comparing public cloud providers. In: 10th ACM Internet Measurement Conference, IMC’10, pp 1–14. Melbourne (2010)
Mell, P., Grance, T.: The NIST definition of cloud computing. Commun. ACM 53(6), 46–51 (2010)
Merkel, D.: Docker: lightweight Linux containers for consistent development and deployment. Linux J. (239):76–91 (2014)
Mizusawa, N., Nakazima, K., Yamaguchi, S.: Performance evaluation of file operations on OverlayFS. In: 5th International Symposium on Computing and Networking, CANDAR’17, pp 597–599. Aomori (2017)
OpenStack Installation Tutorial for Red Hat Enterprise Linux and CentOS: http://docs.openstack.org/newton/install-guide-rdo/. Last visited: June 2018
Peinl, R., Holzschuher, F., Pfitzer, F.: Docker cluster management for the cloud—survey results and own solution. J. Grid Comput. 14(2), 265–282 (2016)
Rackspace website: https://www.rackspace.com. Last visited: June 2018
Ramon-Cortes, C., Serven, A., Ejarque, J., Lezzi, D., Badia, R.M.: Transparent orchestration of task-based parallel applications in containers platforms. J. Grid Comput. 16(1), 137–160 (2018)
Ronacher, A.: Flask, a Python microframework. http://flask.pocoo.org/. Last visited: June 2018
Sefraoui, O., Aissaoui, M., Eleuldj, M.: OpenStack: toward an open-source solution for cloud computing. Int. J. Comput. Appl. 55(3), 38–42 (2012)
Shafer, J.: I/O virtualization bottlenecks in cloud computing today. In: 2nd Workshop on I/O Virtualization, WIOV’10, pp 5:1–5:7. Pittsburgh (2010)
Shafer, J., Rixner, S., Cox, A.L.: The Hadoop distributed filesystem: balancing portability and performance. In: IEEE International Symposium on Performance Analysis of Systems & Software, ISPASS’10, pp 122–133. White Plains (2010)
Shamsi, J., Khojaye, M.A., Qasmi, M.A.: Data-intensive cloud computing: requirements, expectations, challenges, and solutions. J. Grid Comput. 11(2), 281–310 (2013)
Shue, D., Freedman, M.J., Shaikh, A.: Performance isolation and fairness for multi-tenant cloud storage. In: 10th USENIX Symposium on Operating Systems Design and Implementation, OSDI’12, pp 349–362. Hollywood (2012)
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop Distributed File System. In: IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST’10, pp 1–10. Incline Village (2010)
Soltesz, S., Pötzl, H., Fiuczynski, M.E., Bavier, A., Peterson, L.: Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors. In: 2nd ACM European Conference on Computer Systems, EuroSys’07, pp 275–287. Lisbon (2007)
Tihfon, G.M., Park, S., Kim, J., Kim, Y.M.: An efficient multi-task PaaS cloud infrastructure based on Docker and AWS ECS for application deployment. Cluster Comput. 19(3), 1585–1597 (2016)
Varadarajan, V., Kooburat, T., Farley, B., Ristenpart, T., Swift, M.M.: Resource-freeing attacks: improve your cloud performance (at your neighbor’s expense). In: 19th ACM Conference on Computer and Communications Security, CCS’12, pp 281–292. Raleigh (2012)
Vavilapalli, V.K., et al.: Apache Hadoop YARN: Yet Another Resource Negotiator. In: 4th Annual Symposium on Cloud Computing, SOCC’13, pp 5:1–5:16. Santa Clara (2013)
Veiga, J., Enes, J., Expósito, R.R., Touriño, J.: BDEv 3.0: Energy efficiency and microarchitectural characterization of big data processing frameworks. Futur. Gener. Comput. Syst. 86, 565–581 (2018)
Wu, J., Ping, L., Ge, X., Wang, Y., Fu, J.: Cloud storage as the infrastructure of cloud computing. In: International Conference on Intelligent Computing and Cognitive Informatics, ICICCI’10, pp 380–383. Kuala Lumpur (2010)
Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: Simple Linux Utility for Resource Management. In: 9th Workshop on Job Scheduling Strategies for Parallel Processing, JSSPP’03, pp 44–60. Seattle (2003)
Younge, A.J., Henschel, R., Brown, J.T., Von Laszewski, G., Qiu, J., Fox, G.C.: Analysis of virtualization technologies for high performance computing environments. In: 4th IEEE International Conference on Cloud Computing, CLOUD’11, pp 9–16. Washington DC (2011)
Zaharia, M., et al.: Apache Spark: a unified engine for Big Data processing. Commun. ACM 59 (11), 56–65 (2016)
Zeng, W., Zhao, Y., Ou, K., Song, W.: Research on cloud storage architecture and key technologies. In: 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human, ICIS’09, pp 1044–1048. Seoul (2009)
Acknowledgements
This work was supported by the Ministry of Economy, Industry and Competitiveness of Spain (Project TIN2016-75845-P, AEI/FEDER, EU), and by the FPU Program of the Ministry of Education (grant FPU15/03381).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Enes, J., Cacheiro, J.L., Expósito, R.R. et al. Big Data-Oriented PaaS Architecture with Disk-as-a-Resource Capability and Container-Based Virtualization. J Grid Computing 16, 587–605 (2018). https://doi.org/10.1007/s10723-018-9460-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-018-9460-4
Keywords
- Big data
- Platform as a Service (PaaS)
- Cloud computing
- Disk-as-a-resource scheduling
- Operating-system-level virtualization