Implicit Dedupe Learning Method on Contextual Data Quality Problems

Ngueilbaye, Alladoumbaye; Wang, Hongzhi; Mahamat, Daouda Ahmat; Madadjim, Roland

doi:10.1007/978-3-030-71704-9_22

Alladoumbaye Ngueilbaye⁸,
Hongzhi Wang⁸,
Daouda Ahmat Mahamat⁹ &
…
Roland Madadjim¹⁰

Part of the book series: Transactions on Computational Science and Computational Intelligence ((TRACOSCI))

2171 Accesses

Abstract

Variety of applications such as information extraction, data mining, e-learning, or web applications use heterogeneous and distributed data. As a result, the usage of data is challenged by deduplication issues. To harmonize this issue, the present study proposed a novel dedupe learning method (DLM) and other algorithms to detect and correct contextual data quality anomalies. The method was created and implemented on structured data. Our methods have been successful in identifying and correcting more data anomalies than current taxonomy techniques. Consequently, these proposed methods would be important in detecting and correcting errors in broad contextual data (big data).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

N. Abdullah, S.A. Ismail, S. Sophiayati, S.M. Sam, Data quality in big data: a review. Int. J. Advance Soft Comput. Appl. 7(3), 17–27 (2015)
Google Scholar
M. Ahmed, S. Choudhury, F. Al-Turjman, Big data analytics for intelligent internet of things, in Artificial Intelligence in IoT (Springer, Berlin, 2019), pp. 107–127
Google Scholar
D. Ardagna, C. Cappiello, W. Samá, M. Vitali, Context-aware data quality assessment for big data. Future Gener. Comput. Syst. 89, 548–562 (2018)
Article Google Scholar
O. Azeroual, M. Abuosba, Improving the data quality in the research information systems (2019). arXiv preprint arXiv:1901.07388
Google Scholar
C. Batini, A. Rula, M. Scannapieco, G. Viscusi, From data quality to big data quality, in Big Data: Concepts, Methodologies, Tools, and Applications (IGI Global, New York, 2016), pp. 1934–1956
Google Scholar
R.J.C. Bose, R.S. Mans, W.M. van der Aalst, Wanna improve process mining results?, in Proceedings of the 2013 IEEE symposium on computational intelligence and data mining (CIDM) (IEEE, New York, 2013), pp. 127–134
Google Scholar
L. Cai, Y. Zhu, The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 14(2), 1–10, (2015). http://dx.doi.org/10.5334/dsj-2015-002
Google Scholar
F. Chiang, R.J. Miller, Discovering data quality rules. Proc. VLDB Endowment 1(1), 1166–1177 (2008)
Article Google Scholar
F. Chollet, Deep Learning MIT Python und Keras: Das Praxis-Handbuch vom Entwickler der Keras-Bibliothek (MITP-Verlags GmbH and Co. KG, New York, 2018)
Google Scholar
C. Cichy, S. Rass, An overview of data quality frameworks. IEEE Access 7, 24634–24648 (2019)
Article Google Scholar
T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk, Mining database structure; or, how to build a data quality browser, in Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (2002), pp. 240–251
Google Scholar
M.N. Ferozi, Loan Data for Dummy Bank (2018). https://www.kaggle.com/mrferozi/loan-data-for-dummy-bank
T. Gschwandtner, J. Gärtner, W. Aigner, S. Miksch, A taxonomy of dirty time-oriented data, in International Conference on Availability, Reliability, and Security (Springer, Berlin, 2012), pp. 58–72
Google Scholar
V.N. Gudivada, Data analytics: fundamentals, in Data Analytics for Intelligent Transportation Systems (Elsevier, Berlin, 2017), pp. 31–67
Book Google Scholar
W. Kim, B.-J. Choi, E.-K. Hong, S.-K. Kim, D. Lee, A taxonomy of dirty data. Data Min. Knowl. Discovery 7(1), 81–99 (2003)
Article MathSciNet Google Scholar
R. Krishnan, A. Hussain, P. Sherimon, Conceptual clustering of documents for automatic ontology generation, in International Conference on Brain Inspired Cognitive Systems (Springer, Berlin, 2013), pp. 235–244
Google Scholar
L. Li, T. Peng, & J. Kennedy, A rule based taxonomy of dirty data. GSTF Journal on Computing (JoC), 1(2), 140–148 (2014)
Google Scholar
S. Matook, M. Indulska, Improving the quality of process reference models: a quality function deployment-based approach. Decis. Support Syst. 47(1), 60–71 (2009)
Article Google Scholar
J. Merino, I. Caballero, B. Rivas, M. Serrano, M. Piattini, A data quality in use model for big data. Future Gener. Comput. Syst. 63, 123–130 (2016)
Article Google Scholar
M. Mezzanzanica, R. Boselli, M. Cesarini, F. Mercorio, A model-based evaluation of data quality activities in KDD. Inf. Process. Manage. 51(2), 144–166 (2015)
Article Google Scholar
H. Müller, J.-C. Freytag, U. Leser, Improving data quality by source analysis. J. Data Inf. Qual. (JDIQ) 2(4), 1–38 (2012)
Google Scholar
A. Ngueilbaye, L. Lei, H. Wang, Comparative study of data mining techniques on heart disease prediction system: a case study for the “republic of chad”. Int. J. Sci. Res. 5(5), 1564–1571 (2016)
Google Scholar
A. Ngueilbaye, H. Wang, M. Khan, D.A. Mahamat, Adoption of human metabolic processes as data quality based models. J. Supercomputing 77, 1779–1817 (2021). https://doi.org/10.1007/s11227-020-03300-3
Article Google Scholar
P. Oliveira, F. Rodrigues, P. Henriques, H. Galhardas, A taxonomy of data quality problems, in Proceedings of the 2nd International Workshop on Data and Information Quality (2005), pp. 219–233
Google Scholar
S. Pattanayak, S. Pattanayak John, Pro Deep Learning with TensorFlow (Springer, Berlin, 2017)
Book Google Scholar
E. Rahm, H.H. Do, Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Google Scholar
E. Rahm, E. Peukert, Large Scale Entity Resolution (2019)
Google Scholar
S. Ram, J. Park, Semantic conflict resolution ontology (SCROL): an ontology for detecting and resolving data and schema-level semantic conflicts. IEEE Trans. Knowl. Data Eng. 16(2), 189–202 (2004)
Article Google Scholar
H.N. Roa, E. Loza-Aguirre, P. Flores, A survey on the problems affecting the development of open government data initiatives, in Proceedings of the 2019 Sixth International Conference on eDemocracy and eGovernment (ICEDEG) (IEEE, New York, 2019), pp. 157–163
Google Scholar
A.B. Salem et al., Semantic recognition of a data structure in big-data. J. Comput. Commun. 2(09), 93 (2014)
Google Scholar
C. Samitsch, Data Quality and Its Impacts on Decision-making: How Managers can Benefit from Good Data (Springer, Berlin, 2014)
Google Scholar
T. Schäffer, & D. Stelzer, Towards a taxonomy for coordinating quality of master data in product information sharing, In Proceeding of MIT International Conference on Information Quality, UA Little Rock, October 6-7, pp. 1–9.(2017)
Google Scholar
M. Shiloach, S.K. Frencher Jr, J.E. Steeger, K.S. Rowell, K. Bartzokis, M.G. Tomeh, K.E. Richards, C.Y. Ko, B.L. Hall, Toward robust information: data quality and inter-rater reliability in the american college of surgeons national surgical quality improvement program. J. Am. Coll. Surgeons 210(1), 6–16 (2010)
Article Google Scholar
S. Soares, Big data quality, in Big Data Governance: An Emerging Imperative (2012), pp. 101–112
Google Scholar
S. Tejada, C.A. Knoblock, S. Minton, Learning domain-independent string transformation weights for high accuracy object identification, in Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2002), pp. 350–359
Google Scholar
Y. Xiao, L.Y. Lu, J.S. Liu, Z. Zhou, Knowledge diffusion path analysis of data quality literature: a main path analysis. J. Inform. 8(3), 594–605 (2014)
Article Google Scholar
A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, S. Auer, Quality assessment for linked data: a survey. Semantic Web 7(1), 63–93 (2016)
Article Google Scholar

Download references

Acknowledgements

This paper was partially funded by the National Key R&D Program of China under Grant No.2018YFB1004700 and NSFC Grant Nos. U1866602, 61602129, and 61772157.

Author information

Authors and Affiliations

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
Alladoumbaye Ngueilbaye & Hongzhi Wang
Departement d’Informatique, Universite de N’Djamena (Tchad), N’Djamena, Chad
Daouda Ahmat Mahamat
School of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE, USA
Roland Madadjim

Authors

Alladoumbaye Ngueilbaye
View author publications
You can also search for this author in PubMed Google Scholar
Hongzhi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Daouda Ahmat Mahamat
View author publications
You can also search for this author in PubMed Google Scholar
Roland Madadjim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongzhi Wang .

Editor information

Editors and Affiliations

HBS – Hamburg Business School, Institute of Information Systems, University of Hamburg, Hamburg, Hamburg, Germany
Robert Stahlbock
Department of Computer & Information Science, Fordham University, New York, NY, USA
Gary M. Weiss
College of Engineering & Computer Science, University of Michigan-Dearborn, Dearborn, MI, USA
Mahmoud Abou-Nasr
Department of Computer Science, University of Taipei, Taipei City, Taiwan
Cheng-Ying Yang
Department of Computer Science, University of Georgia, Athens, GA, USA
Hamid R. Arabnia
School of Computing and Data Sciences, Wentworth Institute of Technology, Boston, MA, USA
Leonidas Deligiannidis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ngueilbaye, A., Wang, H., Mahamat, D.A., Madadjim, R. (2021). Implicit Dedupe Learning Method on Contextual Data Quality Problems. In: Stahlbock, R., Weiss, G.M., Abou-Nasr, M., Yang, CY., Arabnia, H.R., Deligiannidis, L. (eds) Advances in Data Science and Information Engineering. Transactions on Computational Science and Computational Intelligence. Springer, Cham. https://doi.org/10.1007/978-3-030-71704-9_22

Download citation

DOI: https://doi.org/10.1007/978-3-030-71704-9_22
Published: 30 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-71703-2
Online ISBN: 978-3-030-71704-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics