Skip to main content

Implicit Dedupe Learning Method on Contextual Data Quality Problems

  • Conference paper
  • First Online:
Advances in Data Science and Information Engineering

Abstract

Variety of applications such as information extraction, data mining, e-learning, or web applications use heterogeneous and distributed data. As a result, the usage of data is challenged by deduplication issues. To harmonize this issue, the present study proposed a novel dedupe learning method (DLM) and other algorithms to detect and correct contextual data quality anomalies. The method was created and implemented on structured data. Our methods have been successful in identifying and correcting more data anomalies than current taxonomy techniques. Consequently, these proposed methods would be important in detecting and correcting errors in broad contextual data (big data).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. N. Abdullah, S.A. Ismail, S. Sophiayati, S.M. Sam, Data quality in big data: a review. Int. J. Advance Soft Comput. Appl. 7(3), 17–27 (2015)

    Google Scholar 

  2. M. Ahmed, S. Choudhury, F. Al-Turjman, Big data analytics for intelligent internet of things, in Artificial Intelligence in IoT (Springer, Berlin, 2019), pp. 107–127

    Google Scholar 

  3. D. Ardagna, C. Cappiello, W. Samá, M. Vitali, Context-aware data quality assessment for big data. Future Gener. Comput. Syst. 89, 548–562 (2018)

    Article  Google Scholar 

  4. O. Azeroual, M. Abuosba, Improving the data quality in the research information systems (2019). arXiv preprint arXiv:1901.07388

    Google Scholar 

  5. C. Batini, A. Rula, M. Scannapieco, G. Viscusi, From data quality to big data quality, in Big Data: Concepts, Methodologies, Tools, and Applications (IGI Global, New York, 2016), pp. 1934–1956

    Google Scholar 

  6. R.J.C. Bose, R.S. Mans, W.M. van der Aalst, Wanna improve process mining results?, in Proceedings of the 2013 IEEE symposium on computational intelligence and data mining (CIDM) (IEEE, New York, 2013), pp. 127–134

    Google Scholar 

  7. L. Cai, Y. Zhu, The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 14(2), 1–10, (2015). http://dx.doi.org/10.5334/dsj-2015-002

    Google Scholar 

  8. F. Chiang, R.J. Miller, Discovering data quality rules. Proc. VLDB Endowment 1(1), 1166–1177 (2008)

    Article  Google Scholar 

  9. F. Chollet, Deep Learning MIT Python und Keras: Das Praxis-Handbuch vom Entwickler der Keras-Bibliothek (MITP-Verlags GmbH and Co. KG, New York, 2018)

    Google Scholar 

  10. C. Cichy, S. Rass, An overview of data quality frameworks. IEEE Access 7, 24634–24648 (2019)

    Article  Google Scholar 

  11. T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk, Mining database structure; or, how to build a data quality browser, in Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (2002), pp. 240–251

    Google Scholar 

  12. M.N. Ferozi, Loan Data for Dummy Bank (2018). https://www.kaggle.com/mrferozi/loan-data-for-dummy-bank

  13. T. Gschwandtner, J. Gärtner, W. Aigner, S. Miksch, A taxonomy of dirty time-oriented data, in International Conference on Availability, Reliability, and Security (Springer, Berlin, 2012), pp. 58–72

    Google Scholar 

  14. V.N. Gudivada, Data analytics: fundamentals, in Data Analytics for Intelligent Transportation Systems (Elsevier, Berlin, 2017), pp. 31–67

    Book  Google Scholar 

  15. W. Kim, B.-J. Choi, E.-K. Hong, S.-K. Kim, D. Lee, A taxonomy of dirty data. Data Min. Knowl. Discovery 7(1), 81–99 (2003)

    Article  MathSciNet  Google Scholar 

  16. R. Krishnan, A. Hussain, P. Sherimon, Conceptual clustering of documents for automatic ontology generation, in International Conference on Brain Inspired Cognitive Systems (Springer, Berlin, 2013), pp. 235–244

    Google Scholar 

  17. L. Li, T. Peng, & J. Kennedy, A rule based taxonomy of dirty data. GSTF Journal on Computing (JoC), 1(2), 140–148 (2014)

    Google Scholar 

  18. S. Matook, M. Indulska, Improving the quality of process reference models: a quality function deployment-based approach. Decis. Support Syst. 47(1), 60–71 (2009)

    Article  Google Scholar 

  19. J. Merino, I. Caballero, B. Rivas, M. Serrano, M. Piattini, A data quality in use model for big data. Future Gener. Comput. Syst. 63, 123–130 (2016)

    Article  Google Scholar 

  20. M. Mezzanzanica, R. Boselli, M. Cesarini, F. Mercorio, A model-based evaluation of data quality activities in KDD. Inf. Process. Manage. 51(2), 144–166 (2015)

    Article  Google Scholar 

  21. H. Müller, J.-C. Freytag, U. Leser, Improving data quality by source analysis. J. Data Inf. Qual. (JDIQ) 2(4), 1–38 (2012)

    Google Scholar 

  22. A. Ngueilbaye, L. Lei, H. Wang, Comparative study of data mining techniques on heart disease prediction system: a case study for the “republic of chad”. Int. J. Sci. Res. 5(5), 1564–1571 (2016)

    Google Scholar 

  23. A. Ngueilbaye, H. Wang, M. Khan, D.A. Mahamat, Adoption of human metabolic processes as data quality based models. J. Supercomputing 77, 1779–1817 (2021). https://doi.org/10.1007/s11227-020-03300-3

    Article  Google Scholar 

  24. P. Oliveira, F. Rodrigues, P. Henriques, H. Galhardas, A taxonomy of data quality problems, in Proceedings of the 2nd International Workshop on Data and Information Quality (2005), pp. 219–233

    Google Scholar 

  25. S. Pattanayak, S. Pattanayak John, Pro Deep Learning with TensorFlow (Springer, Berlin, 2017)

    Book  Google Scholar 

  26. E. Rahm, H.H. Do, Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)

    Google Scholar 

  27. E. Rahm, E. Peukert, Large Scale Entity Resolution (2019)

    Google Scholar 

  28. S. Ram, J. Park, Semantic conflict resolution ontology (SCROL): an ontology for detecting and resolving data and schema-level semantic conflicts. IEEE Trans. Knowl. Data Eng. 16(2), 189–202 (2004)

    Article  Google Scholar 

  29. H.N. Roa, E. Loza-Aguirre, P. Flores, A survey on the problems affecting the development of open government data initiatives, in Proceedings of the 2019 Sixth International Conference on eDemocracy and eGovernment (ICEDEG) (IEEE, New York, 2019), pp. 157–163

    Google Scholar 

  30. A.B. Salem et al., Semantic recognition of a data structure in big-data. J. Comput. Commun. 2(09), 93 (2014)

    Google Scholar 

  31. C. Samitsch, Data Quality and Its Impacts on Decision-making: How Managers can Benefit from Good Data (Springer, Berlin, 2014)

    Google Scholar 

  32. T. Schäffer, & D. Stelzer, Towards a taxonomy for coordinating quality of master data in product information sharing, In Proceeding of MIT International Conference on Information Quality, UA Little Rock, October 6-7, pp. 1–9.(2017)

    Google Scholar 

  33. M. Shiloach, S.K. Frencher Jr, J.E. Steeger, K.S. Rowell, K. Bartzokis, M.G. Tomeh, K.E. Richards, C.Y. Ko, B.L. Hall, Toward robust information: data quality and inter-rater reliability in the american college of surgeons national surgical quality improvement program. J. Am. Coll. Surgeons 210(1), 6–16 (2010)

    Article  Google Scholar 

  34. S. Soares, Big data quality, in Big Data Governance: An Emerging Imperative (2012), pp. 101–112

    Google Scholar 

  35. S. Tejada, C.A. Knoblock, S. Minton, Learning domain-independent string transformation weights for high accuracy object identification, in Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2002), pp. 350–359

    Google Scholar 

  36. Y. Xiao, L.Y. Lu, J.S. Liu, Z. Zhou, Knowledge diffusion path analysis of data quality literature: a main path analysis. J. Inform. 8(3), 594–605 (2014)

    Article  Google Scholar 

  37. A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, S. Auer, Quality assessment for linked data: a survey. Semantic Web 7(1), 63–93 (2016)

    Article  Google Scholar 

Download references

Acknowledgements

This paper was partially funded by the National Key R&D Program of China under Grant No.2018YFB1004700 and NSFC Grant Nos. U1866602, 61602129, and 61772157.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongzhi Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ngueilbaye, A., Wang, H., Mahamat, D.A., Madadjim, R. (2021). Implicit Dedupe Learning Method on Contextual Data Quality Problems. In: Stahlbock, R., Weiss, G.M., Abou-Nasr, M., Yang, CY., Arabnia, H.R., Deligiannidis, L. (eds) Advances in Data Science and Information Engineering. Transactions on Computational Science and Computational Intelligence. Springer, Cham. https://doi.org/10.1007/978-3-030-71704-9_22

Download citation

Publish with us

Policies and ethics