Data Preparation as a Service Based on Apache Spark

  • Nivethika MahasivamEmail author
  • Nikolay Nikolov
  • Dina Sukhobok
  • Dumitru Roman
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10465)


Data preparation is the process of collecting, cleaning and consolidating raw datasets into cleaned data of certain quality. It is an important aspect in almost every data analysis process, and yet it remains tedious and time-consuming. The complexity of the process is further increased by the recent tendency to derive knowledge from very large datasets. Existing data preparation tools provide limited capabilities to effectively process such large volumes of data. On the other hand, frameworks and software libraries that do address the requirements of big data, require expert knowledge in various technical areas. In this paper, we propose a dynamic, service-based, scalable data preparation approach that aims to solve the challenges in data preparation on a large scale, while retaining the accessibility and flexibility provided by data preparation tools. Furthermore, we describe its implementation and integration with an existing framework for data preparation – Grafterizer. Our solution is based on Apache Spark, and exposes application programming interfaces (APIs) to integrate with external tools. Finally, we present experimental results that demonstrate the improvements to the scalability of Grafterizer.


Distributed data parallel processing Apache Spark Big data preparation Interactive data preparation 



The work in this paper is partly supported by the EC funded projects proDataMarket (Grant number: 644497), euBusinessGraph (Grant number: 732003), and EW-Shopp (Grant number: 732590). The authors would like to thank Bjørn Marius von Zernichow for his help in improving the readability of the camera-ready version of the paper.


  1. 1.
    Atzmueller, M., Oussena, S., Roth-Berghofe, T.: Data preparation for big data analytics: methods and experiences. In: Enterprise Big Data Engineering, Analytics, and Management, pp. 157–170. IGI Global (2016)Google Scholar
  2. 2.
    Kandel, S., Heer, J., Plaisant, C., Kennedy, J., Ham, F.V., Riche, N.H., Buono, P.: Research directions on data wrangling: visualizations and transformations. Inf. Vis. 10(4), 271–288 (2011)CrossRefGoogle Scholar
  3. 3.
    Krishnan, S., Franklin, M.J., Goldberg, K., Wu, E.: ActiveClean: an interactive data cleaning framework for modern machine learning. In: International Conference on Management of Data, San Francisco, California, USA. ACM (2016)Google Scholar
  4. 4.
    McKinney, W.: Pandas: A Foundational Python Library for DataAnalysis and Statistics. NEM (Networked & Electronic Media) (2011)Google Scholar
  5. 5.
    Jackson, C.J., Vijayakumar, V., Quadir, A.M., Bharathi, C.: Survey on programming models and environments for cluster cloud, and grid computing that defends big data. In: Procedia Computer Science, 2nd International Symposium on Big Data and Cloud Computing (ISBCC 2015), pp. 517–523 (2015)Google Scholar
  6. 6.
    Kandel, S., Paepcke, A., Hellerstein, J., Heer, J.: Enterprise data analysis and visualization: an interview study. IEEE Trans. Vis. Comput. Graph. 18(12), 2917–2926 (2012)CrossRefGoogle Scholar
  7. 7.
    Sukhobok, D., Nikolov, N., Pultier, A., Ye, X., Berre, A., Moynihan, R., Roberts, B., Elvesæter, B., Mahasivam, N., Roman, D.: Tabular data cleaning and linked data generation with Grafterizer. ESWC (Satell. Events) 2016, 134–139 (2016)Google Scholar
  8. 8.
    Roman, D., Nikolov, N., Putlier, A., Sukhobok, D., Elvesæter, B., Berre, A.J., Ye, X., Dimitrov, M., Simov, A., Zarev, M., Moynihan, R., Roberts, B., Berlocher, I., Kim, S., Lee, T., Smith, A., Heath, T.: DataGraft: one-stop-shop for open data management. Semantic Web J. (SWJ) – Interoperability, Usability, Applicability (2017, to appear). doi: 10.3233/SW-170263. Published and printed by IOS Press, ISSN 1570-0844
  9. 9.
    Roman, D., Dimitrov, M., Nikolov, N., Putlier, A., Sukhobok, D., Elvesæter, B., Berre, A.J., Ye, X., Simov, A., Petkov, Y.: DataGraft: simplifying open data publishing. ESWC (Satell. Events) 2016, 101–106 (2016)Google Scholar
  10. 10.
    Roman, D., Dimitrov, M., Nikolov, N., Putlier, A., Elvesæter, B., Simov, A., Petkov, Y.: DataGraft: a platform for open data publishing. In: The Joint Proceedings of the 4th International Workshop on Linked Media and the 3rd Developers Hackshop, (LIME/SemDev@ESWC 2016)Google Scholar
  11. 11.
    Wang, J., Crawl, D., Altintas, I., Tzoumas, K., Markl, V.: Comparison of distributed data-parallelization patterns for big data analysis: a bioinformatics case study. In: Proceedings of the Fourth International Workshop on Data Intensive Computing in the Clouds (DataCloud) (2013)Google Scholar
  12. 12.
    Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G.: Twister: a runtime for iterative MapReduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (2010)Google Scholar
  13. 13.
    Bala, M., Boussaid, O., Alimazighi, Z.: Big-ETL: extracting-transforming-loading approach for big data. In: Proceedings of International Conference on Parallel and Distributed Processing Techniques and Applications, Las Vegas, Neveda, USA (2015)Google Scholar
  14. 14.
    Krukowski, A., Kompatsiaris, Y., Papadopoulos, S., et al.: Big and Open Data Position Paper (2013).
  15. 15.
    Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S., Lax, R., Whittle, S.: The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. In: Proceedings of the 41st International Conference on Very Large Data Bases, pp. 1792–1803, VLDB Endowment, Kohala Coast, Hawaii (2015)Google Scholar
  16. 16.
    Sims, M., Kurose, J.F., Lesser, V.R.: Streaming versus batch processing of sensor data in a hazardous weather detection system. In: Proceedings of Second Annual IEEE Communications Society Conference on Sensor and Ad Hoc Communications and Networks (SECON 2005) (2005)Google Scholar
  17. 17.
    Shahrivari, S.: Beyond batch processing: towards real-time and streaming big data. Computers 3(4), 117–129 (2014)CrossRefGoogle Scholar
  18. 18.
    Furche, T., Gottlob, G., Neumayr, B., Sallinger, E.: Data wrangling for big data: towards a lingua franca for data wrangling (2016)Google Scholar
  19. 19.
    Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394. ACM (2015)Google Scholar
  20. 20.
    Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 2. USENIX Association (2012)Google Scholar
  21. 21.
    Sukhobok, D., Nikolov, N., Roman, D.: Tabular data anomaly patterns. In: 3rd International Conference on Big Data Innovations and Applications. Innovate-Data 2017 (2017, in press)Google Scholar
  22. 22.
    Riazi, S.: SparkGalaxy: workflow-based big data processing (2016)Google Scholar
  23. 23.
    Wang, H., Li, M., Bu, Y., Li, J., Gao, H., Zhang, J.: Cleanix: a parallel big data cleaning system. ACM SIGMOD Rec. 44(4), 35–40 (2016)CrossRefGoogle Scholar
  24. 24.
    Kaur, M., Dhaliwal, G.: Performance comparison of map reduce and Apache Spark. Int. J. Comput. Sci. Eng. 3(11), 66–69 (2015)Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2017

Authors and Affiliations

  • Nivethika Mahasivam
    • 1
    Email author
  • Nikolay Nikolov
    • 1
  • Dina Sukhobok
    • 1
  • Dumitru Roman
    • 1
  1. 1.SINTEFOsloNorway

Personalised recommendations