Abstract
Currently, the traditional architecture of data storage and analysis has become not suitable enough. With rapid flow of information, there is no doubt that big data technology brings significant benefits such as efficiency and productivity. However, a successful approach to big data migration requires efficient architecture. In this paper, we proposed an architecture to import existing power data storage system of our campus into big data platform with Data Lake. We use Apache sqoop to transfer historical data to Apache Hive for data storage. Kafka is used for making sure the integrity of streaming data and as the input source for Spark streaming that writing data to HBase. To integrate the data we use the concept of data lake which based on Hive and HBase. Impala and Apache Phoenix are individually used as search engines for Hive and HBase. Apache Spark can quickly analyze and compute the data from Data Lake, and we choose Apache Superset as the solution for visualization.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Simmhan, Y., Aman, S., Kumbhare, A., Liu, R., Stevens, S., Zhou, Q., Prasanna, V.: Cloud-based software platform for big data analytics in smart grids. Comput. Sci. Eng. 15(4), 38–47 (2013)
Ramakrishnan, R., Sridharan, B., Douceur, J.R., Kasturi, P., Krishnamachari-Sampath, B., Krishnamoorthy, K., Li, P., Manu, M., Michaylov, S., Ramos, R., et al.: Azure data lake store: a hyperscale distributed file service for big data analytics. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 51–63. ACM, New York (2017)
Zikopoulos, P., Eaton, C., et al.: Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. McGraw-Hill Osborne Media, New York (2011)
Bhardwaj, A., Kumar, A., Narayan, Y., Kumar, P., et al.: Big data emerging technologies: a casestudy with analyzing twitter data using apache hive. In: 2015 2nd International Conference on Recent Advances in Engineering & Computational Sciences (RAECS), pp. 1–6. IEEE, New York (2015)
Apache HBase Team. Apache hbase reference guide. Apache, version, 2(0) (2016)
Pal, A., Jain, K., Agrawal, P., Agrawal, S.: A performance analysis of mapreduce task with large number of files dataset in big data using hadoop. In: 2014 Fourth International Conference on Communication Systems and Network Technologies (CSNT), pp. 587–591. IEEE, New York (2014)
Ghat, D., Rorke, D., Kumar, D.: New SQL benchmarks: Apache impala (incubating) uniquely delivers analytic database performance (2016)
Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)
Rangarajan, S., Liu, H., Wang, H., Wang, C.-L.: Scalable architecture for personalized healthcare service recommendation using big data lake. In: Service Research and Innovation, pp. 65–79. Springer, Berlin (2015)
Kathiravelu, P., Sharma, A.: A dynamic data warehousing platform for creating and accessing biomedical data lakes. In: VLDB Workshop on Data Management and Analytics for Medicine and Healthcare, pp. 101–120. Springer, Berlin (2016)
Solaimani, M., Iftekhar, M., Khan, L., Thuraisingham, B., Ingram, J.B.: Spark-based anomaly detection over multi-source VMware performance data in real-time. In: 2014 IEEE Symposium on Computational Intelligence in Cyber Security (CICS), pp. 1–8. IEEE, New York (2014)
Yang, C.-T., Chen, S.-T., Den, W., Wang, Y.-T., Kristiani, E.: Implementation of an intelligent indoor environmental monitoring and management system in cloud. Futur. Gener. Comput. Syst. (2018)
Gupta, K., Sachdev, A., Sureka, A.: Empirical analysis on comparing the performance of alpha miner algorithm in SQL query language and NoSQL column-oriented databases using apache phoenix. arXiv preprint arXiv:1703.05481 (2017)
Carcillo, F., Dal Pozzolo, A., Le Borgne, Y.-A., Caelen, O., Mazzer, Y., Bontempi, G.: Scarff: a scalable framework for streaming credit card fraud detection with spark. Inf. Fusion 41, 182–194 (2018)
Chen, L., Ko, J., Yeo, J.: Analysis of the influence factors of data loading performance using apache sqoop. KIPS Trans. Softw. Data Eng. 4(2), 77–82 (2015)
Wang, G., Koshy, J., Subramanian, S., Paramasivam, K., Zadeh, M., Narkhede, N., Rao, J., Kreps, J., Stein, J.: Building a replicated logging system with Apache Kafka. Proc. VLDB Endow. 8(12), 1654–1655 (2015)
Acknowledgement
This work was supported in part by the Ministry of Science and Technology, Taiwan R.O.C., under grants number MOST 104-2221-E-029-010-MY3 and MOST 106-2622-E-029-002-CC3.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Chen, TY., Yang, CT., Kristiani, E., Cheng, CT. (2019). On Construction of a Power Data Lake Platform Using Spark. In: Hung, J., Yen, N., Hui, L. (eds) Frontier Computing. FC 2018. Lecture Notes in Electrical Engineering, vol 542. Springer, Singapore. https://doi.org/10.1007/978-981-13-3648-5_11
Download citation
DOI: https://doi.org/10.1007/978-981-13-3648-5_11
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-3647-8
Online ISBN: 978-981-13-3648-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)