Abstract
Variety of applications such as information extraction, data mining, e-learning, or web applications use heterogeneous and distributed data. As a result, the usage of data is challenged by deduplication issues. To harmonize this issue, the present study proposed a novel dedupe learning method (DLM) and other algorithms to detect and correct contextual data quality anomalies. The method was created and implemented on structured data. Our methods have been successful in identifying and correcting more data anomalies than current taxonomy techniques. Consequently, these proposed methods would be important in detecting and correcting errors in broad contextual data (big data).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
N. Abdullah, S.A. Ismail, S. Sophiayati, S.M. Sam, Data quality in big data: a review. Int. J. Advance Soft Comput. Appl. 7(3), 17–27 (2015)
M. Ahmed, S. Choudhury, F. Al-Turjman, Big data analytics for intelligent internet of things, in Artificial Intelligence in IoT (Springer, Berlin, 2019), pp. 107–127
D. Ardagna, C. Cappiello, W. Samá, M. Vitali, Context-aware data quality assessment for big data. Future Gener. Comput. Syst. 89, 548–562 (2018)
O. Azeroual, M. Abuosba, Improving the data quality in the research information systems (2019). arXiv preprint arXiv:1901.07388
C. Batini, A. Rula, M. Scannapieco, G. Viscusi, From data quality to big data quality, in Big Data: Concepts, Methodologies, Tools, and Applications (IGI Global, New York, 2016), pp. 1934–1956
R.J.C. Bose, R.S. Mans, W.M. van der Aalst, Wanna improve process mining results?, in Proceedings of the 2013 IEEE symposium on computational intelligence and data mining (CIDM) (IEEE, New York, 2013), pp. 127–134
L. Cai, Y. Zhu, The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 14(2), 1–10, (2015). http://dx.doi.org/10.5334/dsj-2015-002
F. Chiang, R.J. Miller, Discovering data quality rules. Proc. VLDB Endowment 1(1), 1166–1177 (2008)
F. Chollet, Deep Learning MIT Python und Keras: Das Praxis-Handbuch vom Entwickler der Keras-Bibliothek (MITP-Verlags GmbH and Co. KG, New York, 2018)
C. Cichy, S. Rass, An overview of data quality frameworks. IEEE Access 7, 24634–24648 (2019)
T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk, Mining database structure; or, how to build a data quality browser, in Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (2002), pp. 240–251
M.N. Ferozi, Loan Data for Dummy Bank (2018). https://www.kaggle.com/mrferozi/loan-data-for-dummy-bank
T. Gschwandtner, J. Gärtner, W. Aigner, S. Miksch, A taxonomy of dirty time-oriented data, in International Conference on Availability, Reliability, and Security (Springer, Berlin, 2012), pp. 58–72
V.N. Gudivada, Data analytics: fundamentals, in Data Analytics for Intelligent Transportation Systems (Elsevier, Berlin, 2017), pp. 31–67
W. Kim, B.-J. Choi, E.-K. Hong, S.-K. Kim, D. Lee, A taxonomy of dirty data. Data Min. Knowl. Discovery 7(1), 81–99 (2003)
R. Krishnan, A. Hussain, P. Sherimon, Conceptual clustering of documents for automatic ontology generation, in International Conference on Brain Inspired Cognitive Systems (Springer, Berlin, 2013), pp. 235–244
L. Li, T. Peng, & J. Kennedy, A rule based taxonomy of dirty data. GSTF Journal on Computing (JoC), 1(2), 140–148 (2014)
S. Matook, M. Indulska, Improving the quality of process reference models: a quality function deployment-based approach. Decis. Support Syst. 47(1), 60–71 (2009)
J. Merino, I. Caballero, B. Rivas, M. Serrano, M. Piattini, A data quality in use model for big data. Future Gener. Comput. Syst. 63, 123–130 (2016)
M. Mezzanzanica, R. Boselli, M. Cesarini, F. Mercorio, A model-based evaluation of data quality activities in KDD. Inf. Process. Manage. 51(2), 144–166 (2015)
H. Müller, J.-C. Freytag, U. Leser, Improving data quality by source analysis. J. Data Inf. Qual. (JDIQ) 2(4), 1–38 (2012)
A. Ngueilbaye, L. Lei, H. Wang, Comparative study of data mining techniques on heart disease prediction system: a case study for the “republic of chad”. Int. J. Sci. Res. 5(5), 1564–1571 (2016)
A. Ngueilbaye, H. Wang, M. Khan, D.A. Mahamat, Adoption of human metabolic processes as data quality based models. J. Supercomputing 77, 1779–1817 (2021). https://doi.org/10.1007/s11227-020-03300-3
P. Oliveira, F. Rodrigues, P. Henriques, H. Galhardas, A taxonomy of data quality problems, in Proceedings of the 2nd International Workshop on Data and Information Quality (2005), pp. 219–233
S. Pattanayak, S. Pattanayak John, Pro Deep Learning with TensorFlow (Springer, Berlin, 2017)
E. Rahm, H.H. Do, Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
E. Rahm, E. Peukert, Large Scale Entity Resolution (2019)
S. Ram, J. Park, Semantic conflict resolution ontology (SCROL): an ontology for detecting and resolving data and schema-level semantic conflicts. IEEE Trans. Knowl. Data Eng. 16(2), 189–202 (2004)
H.N. Roa, E. Loza-Aguirre, P. Flores, A survey on the problems affecting the development of open government data initiatives, in Proceedings of the 2019 Sixth International Conference on eDemocracy and eGovernment (ICEDEG) (IEEE, New York, 2019), pp. 157–163
A.B. Salem et al., Semantic recognition of a data structure in big-data. J. Comput. Commun. 2(09), 93 (2014)
C. Samitsch, Data Quality and Its Impacts on Decision-making: How Managers can Benefit from Good Data (Springer, Berlin, 2014)
T. Schäffer, & D. Stelzer, Towards a taxonomy for coordinating quality of master data in product information sharing, In Proceeding of MIT International Conference on Information Quality, UA Little Rock, October 6-7, pp. 1–9.(2017)
M. Shiloach, S.K. Frencher Jr, J.E. Steeger, K.S. Rowell, K. Bartzokis, M.G. Tomeh, K.E. Richards, C.Y. Ko, B.L. Hall, Toward robust information: data quality and inter-rater reliability in the american college of surgeons national surgical quality improvement program. J. Am. Coll. Surgeons 210(1), 6–16 (2010)
S. Soares, Big data quality, in Big Data Governance: An Emerging Imperative (2012), pp. 101–112
S. Tejada, C.A. Knoblock, S. Minton, Learning domain-independent string transformation weights for high accuracy object identification, in Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2002), pp. 350–359
Y. Xiao, L.Y. Lu, J.S. Liu, Z. Zhou, Knowledge diffusion path analysis of data quality literature: a main path analysis. J. Inform. 8(3), 594–605 (2014)
A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, S. Auer, Quality assessment for linked data: a survey. Semantic Web 7(1), 63–93 (2016)
Acknowledgements
This paper was partially funded by the National Key R&D Program of China under Grant No.2018YFB1004700 and NSFC Grant Nos. U1866602, 61602129, and 61772157.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Ngueilbaye, A., Wang, H., Mahamat, D.A., Madadjim, R. (2021). Implicit Dedupe Learning Method on Contextual Data Quality Problems. In: Stahlbock, R., Weiss, G.M., Abou-Nasr, M., Yang, CY., Arabnia, H.R., Deligiannidis, L. (eds) Advances in Data Science and Information Engineering. Transactions on Computational Science and Computational Intelligence. Springer, Cham. https://doi.org/10.1007/978-3-030-71704-9_22
Download citation
DOI: https://doi.org/10.1007/978-3-030-71704-9_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-71703-2
Online ISBN: 978-3-030-71704-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)