Skip to main content
Log in

Efficient supervised and semi-supervised approaches for affiliations disambiguation

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

The disambiguation of named entities is a challenge in many fields such as scientometrics, social networks, record linkage, citation analysis, semantic web…etc. The names ambiguities can arise from misspelling, typographical or OCR mistakes, abbreviations, omissions… Therefore, the search of names of persons or of organizations is difficult as soon as a single name might appear in many different forms. This paper proposes two approaches to disambiguate on the affiliations of authors of scientific papers in bibliographic databases: the first way considers that a training dataset is available, and uses a Naive Bayes model. The second way assumes that there is no learning resource, and uses a semi-supervised approach, mixing soft-clustering and Bayesian learning. The results are encouraging and the approach is already partially applied in a scientific survey department. However, our experiments also highlight that our approach has some limitations: it cannot process efficiently highly unbalanced data. Alternatives solutions are possible for future developments, particularly with the use of a recent clustering algorithm relying on feature maximization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Aswani, N., Bontcheva, K., & Cunningham, H. (2006). Mining information for instance Unification. Lecture Notes in Computer Science, 4273, 329–334.

    Article  Google Scholar 

  • Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., & Fienberg, S. (2003). Adaptive name matching in information integration. IEEE Intelligent Systems, 18(5), 16–23.

    Article  Google Scholar 

  • Bourke, P., & Butler, L. (1996). Standards issues in a national bibliometric database: The Australian case. Scientometrics, 35(2), 199–207.

    Article  Google Scholar 

  • Carayol, N., & Cassi, L. (2009). Whos who in patents. A Bayesian approach. http://hal-paris1.archives-ouvertes.fr/hal-00631750. Accessed 15 April 2013.

  • Churches, T., Christen, P., Lim, K., & Zhu, J. X. (2002). Preparation of name and address data for record linkage using hidden Markov models. BMC Medical Informatics and Decision Making, 2. doi:10.1186/1472-6947-2-9.

  • Cleuziou, G. (2008). An extended version of the k-means method for overlapping clustering. In 19th. International Conference on Pattern Recognition (ICPR 2008), pp. 1–4.

  • De Bruin, R. E., & Moed, H. F. (1990). The unification of addresses in scientific publications. Informetrics 1989/90, 6578. Amsterdam: Elsevier.

    Google Scholar 

  • De Bruin, R. E., & Moed, H. F. (1993). Delimitation of scientific subfields using cog nitive words from corporate addresses in scientific publications. Scientometrics, 26(1), 65–80.

    Article  Google Scholar 

  • Domingos, P., & Pazzani, M. (1996). Beyond independence: Conditions for the optimality of the simple Bayesian classifier. In International Conference on Machine Learning (ICML), 105–112, Bari.

  • Fellegi, I., & Sunter, A. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.

    Article  Google Scholar 

  • French, J. C., Powell, A. L., & Schulman, E. (2000). Using clustering strategies for creating authority files. Journal of the American Society for Information Science and Technology, 51, 774–786.

    Article  Google Scholar 

  • Galvez, C., & Moya-Anegn, F. (2006). The unification of institutional addresses applying parametrized finite-state graphs (P-FSG). Scientometrics, 69(2), 323–345.

    Article  Google Scholar 

  • Hand, D. J., & Yu, K. (2001). Idiots Bayes not so stupid after all? International Statistical Review, 69(3), 385–398.

    Article  MATH  Google Scholar 

  • Hood, W., & Wilson, C. (2003). Informetric studies using databases: Opportunities and challenges. Scientometrics, 58(3), 587–608.

    Article  Google Scholar 

  • Huang, J., Ertekin, S., & Giles, C. L. (2006). Efficient name disambiguation for large-scale databases. PKDD 06. LNAI, 4213:536–544, Berlin: Springer.

    Google Scholar 

  • Jiang, Y., Zheng, H.-T., Wang, X., Lu, B., & Wu, K. (2011). Affiliation disambiguation for constructing semantic digital libraries. Journal of the American Society for Information Science and Technology, 62(6), 1029–1041.

    Article  Google Scholar 

  • Lamirel J.-C., Mall R., Cuxac P., & Safi G. (2011). Variations to incremental growing neural gas algorithm based on label maximization. In International Joint Conference on neural networksIJCNN 2011, p. 956–965.

  • Lelu, A. (1993). Modèles neuronaux pour l’analyse de donnes documentaires et textuelles. PhD: University Paris. 6.

    Google Scholar 

  • Liu, N. C., Cheng, Y., & Liu, L. (2005). Academic ranking of world universities using scientometrics: A comment to the fatal attraction. Scientometrics, 64(1), 101–112.

    Article  Google Scholar 

  • MacQueen J. (1967). Some methods for classification and analysis of multivariate observations. Proceeding of the 5th Berkeley Symposium on Mathematical Statistics and Probability, p. 281–297.

  • Moed, H. F. (2005). Citation analysis in research evaluation. Dordrecht: Springer.

    Google Scholar 

  • Niu, L., Wu, J., & Shi, Y. (2012). Entity disambiguation with textual and connection information. Procedia Computer Science, 9, 1249–1255.

    Article  Google Scholar 

  • Osareh, F., & Wilson, C. S. (2000). A comparison of Iranian scientific publications in the SCI: 1985–1989 and 1990–1994. Scientometrics, 48(3), 427–442.

    Article  Google Scholar 

  • Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4), 3–13.

    Google Scholar 

  • Sadinle, M., & Fienberg, S. E. (2012). A generalized Fellegi-Sunter framework for Multiple record linkage with application to homicide record-systems. arXiv:1205.3217. http://arxiv.org/abs/1205.3217. Accessed 15 April 2013.

  • Sadinle, M., Hall, R., & Fienberg, S. (2010). Approaches to multiple record linkage. Cscmuedu, http://www.cs.cmu.edu/~rjhall/ISIpaperfinal.pdf. Accessed 15 April 2013.

  • Van Raan, A. F. J. (2005). Fatal attraction: Conceptual and methodological problems in the ranking of universities by bibliometric methods. Scientometrics, 62(1), 133–143.

    Article  Google Scholar 

  • Ventura, S. L., Nugent, R., & Fuchs, E. R. H. (2012). Methods matter: Revamping inventor disambiguation algorithms with classification models and labeled inventor records. SSRN eLibrary. http://papers.ssrn.com/sol3/papers.cfm?abstractid=2079330.

  • Wang, J., Berzins, K., Hicks, D., Melkers, J., Xiao, F., & Pinheiro, D. (2012). A boosted-trees method for name disambiguation. Scientometrics, 93, 1–21.

    Article  Google Scholar 

  • Zhou, Y., Talburt, J. R., Su, Y., & Yin, L. (2010). OYSTER: A tool for entity resolution in health information exchange. Proceedings of the 5th International Conference on Cooperation and Promotion of Information Resources in Science and Technology (COINFO 2010 E-BOOK), 358–364.

  • Zitt, M., & Bassecoulard, E. (2008). Challenges for scientometric indicators: data demining, knowledge-flow measurements and diversity issues. Ethics in Science and Environmental Politics, 8, 49–60.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pascal Cuxac.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cuxac, P., Lamirel, JC. & Bonvallot, V. Efficient supervised and semi-supervised approaches for affiliations disambiguation. Scientometrics 97, 47–58 (2013). https://doi.org/10.1007/s11192-013-1025-5

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-013-1025-5

Keywords

Navigation