Efficient supervised and semi-supervised approaches for affiliations disambiguation
- 470 Downloads
The disambiguation of named entities is a challenge in many fields such as scientometrics, social networks, record linkage, citation analysis, semantic web…etc. The names ambiguities can arise from misspelling, typographical or OCR mistakes, abbreviations, omissions… Therefore, the search of names of persons or of organizations is difficult as soon as a single name might appear in many different forms. This paper proposes two approaches to disambiguate on the affiliations of authors of scientific papers in bibliographic databases: the first way considers that a training dataset is available, and uses a Naive Bayes model. The second way assumes that there is no learning resource, and uses a semi-supervised approach, mixing soft-clustering and Bayesian learning. The results are encouraging and the approach is already partially applied in a scientific survey department. However, our experiments also highlight that our approach has some limitations: it cannot process efficiently highly unbalanced data. Alternatives solutions are possible for future developments, particularly with the use of a recent clustering algorithm relying on feature maximization.
KeywordsAffiliation Disambiguation Data cleaning Classification Clustering Semi-supervised Bibliographic databases K-means Naive bayes
- Carayol, N., & Cassi, L. (2009). Whos who in patents. A Bayesian approach. http://hal-paris1.archives-ouvertes.fr/hal-00631750. Accessed 15 April 2013.
- Churches, T., Christen, P., Lim, K., & Zhu, J. X. (2002). Preparation of name and address data for record linkage using hidden Markov models. BMC Medical Informatics and Decision Making, 2. doi: 10.1186/1472-6947-2-9.
- Cleuziou, G. (2008). An extended version of the k-means method for overlapping clustering. In 19th. International Conference on Pattern Recognition (ICPR 2008), pp. 1–4.Google Scholar
- De Bruin, R. E., & Moed, H. F. (1990). The unification of addresses in scientific publications. Informetrics 1989/90, 6578. Amsterdam: Elsevier.Google Scholar
- Domingos, P., & Pazzani, M. (1996). Beyond independence: Conditions for the optimality of the simple Bayesian classifier. In International Conference on Machine Learning (ICML), 105–112, Bari.Google Scholar
- Huang, J., Ertekin, S., & Giles, C. L. (2006). Efficient name disambiguation for large-scale databases. PKDD 06. LNAI, 4213:536–544, Berlin: Springer.Google Scholar
- Lamirel J.-C., Mall R., Cuxac P., & Safi G. (2011). Variations to incremental growing neural gas algorithm based on label maximization. In International Joint Conference on neural networks–IJCNN 2011, p. 956–965.Google Scholar
- Lelu, A. (1993). Modèles neuronaux pour l’analyse de donnes documentaires et textuelles. PhD: University Paris. 6.Google Scholar
- MacQueen J. (1967). Some methods for classification and analysis of multivariate observations. Proceeding of the 5th Berkeley Symposium on Mathematical Statistics and Probability, p. 281–297.Google Scholar
- Moed, H. F. (2005). Citation analysis in research evaluation. Dordrecht: Springer.Google Scholar
- Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4), 3–13.Google Scholar
- Sadinle, M., & Fienberg, S. E. (2012). A generalized Fellegi-Sunter framework for Multiple record linkage with application to homicide record-systems. arXiv:1205.3217. http://arxiv.org/abs/1205.3217. Accessed 15 April 2013.
- Sadinle, M., Hall, R., & Fienberg, S. (2010). Approaches to multiple record linkage. Cscmuedu, http://www.cs.cmu.edu/~rjhall/ISIpaperfinal.pdf. Accessed 15 April 2013.
- Ventura, S. L., Nugent, R., & Fuchs, E. R. H. (2012). Methods matter: Revamping inventor disambiguation algorithms with classification models and labeled inventor records. SSRN eLibrary. http://papers.ssrn.com/sol3/papers.cfm?abstractid=2079330.
- Zhou, Y., Talburt, J. R., Su, Y., & Yin, L. (2010). OYSTER: A tool for entity resolution in health information exchange. Proceedings of the 5th International Conference on Cooperation and Promotion of Information Resources in Science and Technology (COINFO 2010 E-BOOK), 358–364.Google Scholar