Abstract
Data normalization is a laborious and costly process taking place in master data management soft-ware development in enterprises. We analyze the subtasks of the normalization and propose an approach to automating the most laborious of these subtasks. Also, we describe a software system implementing the proposed approach and automatically learning the expert skills.
Similar content being viewed by others
References
Smyth, W., Computing Patterns in Strings, Addison-Wesley, 2003.
Chernyak, L., Data integration: syntax and semantics, Otkrytye Systemy, 2009, no. 10.
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., and Widom, J., Swoosh: a generic approach to entity resolution, The VLDB Journal, 2009, vol. 18, no. 1, pp. 255–276.
Brizan, D.G. and Tansel, A.U., A survey of entity resolution and record linkage methodologies, Commun. IIMA, 2006, vol. 6, no, 3, pp. 41–50.
Califf, M.E. and Mooney, R.J., Relational learning of pattern-match rules for information extraction, Proc. of the Sixteenth Natl. Conf. on Artificial Intelligence (AAAI-99), Menlo Park, CA, American Association for Artificial Intelligence, 1999, pp. 328–334.
Cheung, S.N.S., Economic organization and transaction costs, The New Palgrave: A Dictionary of Economics, Macmillan, 1987, vol. 2, pp. 55–58.
Churches, T., Christen, P., Lim, K., and Zhu, J., Preparation of name and address data for record linkage using hidden Markov models, BMC Med. Inf. Decis. Making, 2002, vol. 2, no. 9.
Dreibelbis, A., Hechler, E., Milman, I., Oberhofer, M., van Run, P., and Wolfson, D., Enterprise Master Data Management: An SOA Approach to Managing Core Information, IBM, 2008.
Elmagarmid, A.K., Ipeirotis, P.G., and Verykios, V.S., Duplicate record detection: a survey, IEEE Trans. Knowl. Data Eng., 2007, vol. 19, no. 1, pp. 1–16.
Jurafsky, D. and Martin, J.H., Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition, Prentice Hall, 2008.
Klaes, M., History of transaction costs, The New Palgrave: Dictionary of Economics, Macmillan, 2008, vol. 8, pp. 363–366.
Köpcke, H. and Rahm, E., Frameworks for entity matching: a comparison, Data Knowl. Eng., 2010, vol. 69, no. 2, pp. 197–210.
Littlestone, N., Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm, Mach. Learn., 1988, vol. 2, no. 4, pp. 285–318.
Maluf, D.A., Bell, D.G., and Ashish, N., Lean middleware, Proc. of the 2005 ACM SIGMOD Int. Conf. on Management of Data SIGMOD’05, New York: ACM, 2005, pp. 788–791.
Ouaguenouni, S., Sivaraman, K., and Braun, T., Identity resolution and data quality algorithms for master person index. An Oracle white paper, August 2010.
Rahm, E. and Do, H.H., Data cleaning: problems and current approaches, IEEE Data Eng. Bull., 2000, vol. 23, no. 4, pp. 3–13.
Author information
Authors and Affiliations
Corresponding author
Additional information
Original Russian Text © Ya.R. Nedumov, D.Yu. Turdakov, V.D. Maiorov, P.E. Ovchinnikov, 2013, published in Programmirovanie, 2013, Vol. 39, No. 3.
This is a joint study of 1C company and MIPT Innovation Lab within the project approved by Decree no. 218 (April 9, 2010) of the Russian Government.
Rights and permissions
About this article
Cite this article
Nedumov, Y.R., Turdakov, D.Y., Maiorov, V.D. et al. Automation of data normalization for implementing master data management systems. Program Comput Soft 39, 115–123 (2013). https://doi.org/10.1134/S0361768813030055
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S0361768813030055