Towards De-duplication Framework in Big Data Analysis. A Case Study

  • Jacek MaślankowskiEmail author
Conference paper
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 264)


Big Data analysis gives access to wider perspectives of information. Especially it allows processing unstructured and structured data together. However lots of data sources do not mean that the quality of data is enough to provide reliable results. There are several different quality indicators related to Big Data analysis. In this paper we will focus on two of them that are the most critical in the first phase of data processing: ambiguousness and duplicates. The goal of this paper is to present the proposal of the framework used to eliminate duplicates in large datasets acquired with Big Data analysis.


Business informatics Big Data Unstructured data Data analysis Data quality 


  1. 1.
    Maślankowski, J.: Data quality issues concerning statistical data gathering supported by Big Data technology. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B. (eds.) BDAS 2014. CCIS, vol. 424, pp. 92–101. Springer, Heidelberg (2014)CrossRefGoogle Scholar
  2. 2.
    Rousidis, D., Garoufallou, E., Balatsoukas, P., Sicilia, M.: Metadata for Big Data: a preliminary investigation of metadata quality issues in research data repositories. Inf. Serv. Use 34(3/4), 279–286 (2014)Google Scholar
  3. 3.
    Hucheng, Z., Jian-Guang, L., Hongyu, Z., Haibo, L., Haoxiang, L., Tingting, Q.: An empirical study on quality issues of production Big Data platform. In: ICSE: International Conference on Software Engineering, pp. 17–26 (2015)Google Scholar
  4. 4.
    Hazen, B., Boone, C., Ezell, J., Jones-Farmer, L.: Data quality for data science, predictive analytics, and Big Data in supply chain management: an introduction to the problem and suggestions for research and applications. Int. J. Prod. Econ. 154, 72–80 (2015)CrossRefGoogle Scholar
  5. 5.
    Di Pietro, R., Sorniotti, A.: Proof of ownership for deduplication systems: a secure, scalable, and efficient solution. Comput. Commun. 82, 71–82 (2016)CrossRefGoogle Scholar
  6. 6.
    Mao, B., Jiang, H., Wu, S., Tian, L.: Leveraging data deduplication to improve the performance of primary storage systems in the cloud. IEEE Trans. Comput. 65(6), 1775–1788 (2016)CrossRefGoogle Scholar
  7. 7.
    Kun, M., Fusen, D., Bo, Y.: Large-scale schema-free data deduplication approach with adaptive sliding window using MapReduce. Comput. J. 58(11), 3187–3201 (2015)CrossRefGoogle Scholar
  8. 8.
    Han, J., Chen, K., Wang, J.: Web article quality ranking based on web community knowledge. Computing 97(5), 509–537 (2015)CrossRefGoogle Scholar
  9. 9.
    Polidoro, F., Giannini, R., Lo Conte, R., Mosca, S., Rossetti, F.: Web scraping techniques to collect data on consumer electronics and airfares for Italian HICP compilation. Stat. J. IAOS 31(2), 165–176 (2015)CrossRefGoogle Scholar
  10. 10.
    Agafiţei, M., Gras, F., Kloek, W., Reis, F., Vâju, S.: Measuring output quality for multisource statistics in official statistics: some directions. Stat. J. IAOS 31(2), 203–211 (2015)CrossRefGoogle Scholar
  11. 11.
    Angiuli, O., Blitzstein, J., Waldo, J.: How to de-identify your data. Commun. ACM 58(12), 48–55 (2015)CrossRefGoogle Scholar
  12. 12.
    Maté, A., Llorens, H., de Gregorio, E., Tardío, R., Gil, D., Muñoz-Terol, R., Trujillo, J.: A novel multidimensional approach to integrate big data in business intelligence. J. Database Manage. 26(2), 14–31 (2015)CrossRefGoogle Scholar
  13. 13.
    Clegg, D.: Evolving data warehouse and BI architectures: the Big Data challenge. Bus. Intell. J. 20(1), 19–24 (2015)Google Scholar
  14. 14.
    Akbay, S.: How Big Data applications are revolutionizing decision making. Bus. Intell. J. 20(1), 25–29 (2015)Google Scholar
  15. 15.
    Martin, K.E.: Ethical issues in the Big Data industry. MIS Q. Executive 14(2), 67–85 (2015)Google Scholar
  16. 16.
    Goes, P.B.: Big Data and IS research. MIS Q. 38(3), iii–viii (2014)Google Scholar
  17. 17.
    Kugler, L.: What happens when Big Data blunders? Commun. ACM 59(6), 15–16 (2016)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.Department of Business InformaticsUniversity of GdańskGdańskPoland

Personalised recommendations