Advertisement

Performance Comparison of Three Spark-Based Implementations of Parallel Entity Resolution

  • Xiao Chen
  • Kirity Rapuru
  • Gabriel Campero Durand
  • Eike Schallehn
  • Gunter Saake
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 903)

Abstract

During the last decade, several big data processing frameworks have emerged enabling users to analyze large scale data with ease. With the help of those frameworks, people are easier to manage distributed programming, failures and data partitioning issues. Entity Resolution is a typical application that requires big data processing frameworks, since its time complexity increases quadratically with the input data. In recent years Apache Spark has become popular as a big data framework providing a flexible programming model that supports in-memory computation. Spark offers three APIs: RDDs, which gives users core low-level data access, and high-level APIs like DataFrame and Dataset, which are part of the Spark SQL library and undergo a process of query optimization. Stemming from their different features, the choice of API can be expected to have an influence on the resulting performance of applications. However, few studies offer experimental measures to characterize the effect of such distinctions. In this paper we evaluate the performance impact of such choices for the specific application of parallel entity resolution under two different scenarios, with the goal to offer practical guidelines for developers.

Keywords

Entity resoution Apache Spark Parallel computation High/low-level APIs 

Notes

Acknowledgment

This work was supported by China Scholarship Council [No. 201408080093].

References

  1. 1.
    Apache: Apache spark. http://spark.apache.org/. Accessed 10 April 2018
  2. 2.
    Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394. ACM (2015)Google Scholar
  3. 3.
    Chen, X., Schallehn, E., Saake, G.: Cloud-scale entity resolution: current state and open challenges. Open J. Big Data (OJBD) 4(1), 30–51 (2018)Google Scholar
  4. 4.
    Chen, X., Zoun, R., Schallehn, E., Mantha, S., Rapuru, K., Saake, G.: Exploring spark-SQL-based entity resolution using the persistence capability. In: International Conference: Beyond Databases, Architectures and Structures (2018, Forthcoming)Google Scholar
  5. 5.
    Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. DCSA. Springer Science & Business Media, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-31164-2CrossRefGoogle Scholar
  6. 6.
    Christen, P., Vatsalan, D.: Flexible and extensible generation and corruption of personal data. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, CIKM 2013, pp. 1165–1168. ACM, New York (2013).  https://doi.org/10.1145/2505515.2507815
  7. 7.
    Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: KDD Workshop on Data Cleaning and Object Consolidation, vol. 3, pp. 73–78 (2003)Google Scholar
  8. 8.
    Hortonworks: Hortonworks data platform. https://hortonworks.com/products/data-platforms/. Accessed 25 June 2018
  9. 9.
    Karau, H., Warren, R.: High Performance Spark. O’Reilly Media, Sebastopol (2017)Google Scholar
  10. 10.
    Mestre, D.G., Pires, C.E.S., Nascimento, D.C., de Queiroz, A.R.M., Santos, V.B., Araujo, T.B.: An efficient spark-based adaptive windowing for entity matching. J. Syst. Softw. 128, 1–10 (2017)CrossRefGoogle Scholar
  11. 11.
    Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. Proc. VLDB Endow. 9(9), 684–695 (2016).  https://doi.org/10.14778/2947618.2947624CrossRefGoogle Scholar
  12. 12.
    Pita, R., Pinto, C., Melo, P., Silva, M., Barreto, M., Rasella, D.: A spark-based workflow for probabilistic record linkage of healthcare data. In: EDBT/ICDT Workshops, pp. 17–26 (2015)Google Scholar
  13. 13.
    Tran, K.N., Vatsalan, D., Christen, P.: GeCo: an online personal data generator and corruptor. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, CIKM 2013, pp. 2473–2476. ACM, New York (2013).  https://doi.org/10.1145/2505515.2508207
  14. 14.
    Wang, C., Karimi, S.: Parallel duplicate detection in adverse drug reaction databases with spark. In: EDBT, pp. 551–562 (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Xiao Chen
    • 1
  • Kirity Rapuru
    • 1
  • Gabriel Campero Durand
    • 1
  • Eike Schallehn
    • 1
  • Gunter Saake
    • 1
  1. 1.Otto-von-Guericke-University of MagdeburgMagdeburgGermany

Personalised recommendations