Scientometrics

, Volume 97, Issue 1, pp 47–58 | Cite as

Efficient supervised and semi-supervised approaches for affiliations disambiguation

  • Pascal Cuxac
  • Jean-Charles Lamirel
  • Valerie Bonvallot
Article

Abstract

The disambiguation of named entities is a challenge in many fields such as scientometrics, social networks, record linkage, citation analysis, semantic web…etc. The names ambiguities can arise from misspelling, typographical or OCR mistakes, abbreviations, omissions… Therefore, the search of names of persons or of organizations is difficult as soon as a single name might appear in many different forms. This paper proposes two approaches to disambiguate on the affiliations of authors of scientific papers in bibliographic databases: the first way considers that a training dataset is available, and uses a Naive Bayes model. The second way assumes that there is no learning resource, and uses a semi-supervised approach, mixing soft-clustering and Bayesian learning. The results are encouraging and the approach is already partially applied in a scientific survey department. However, our experiments also highlight that our approach has some limitations: it cannot process efficiently highly unbalanced data. Alternatives solutions are possible for future developments, particularly with the use of a recent clustering algorithm relying on feature maximization.

Keywords

Affiliation Disambiguation Data cleaning Classification Clustering Semi-supervised Bibliographic databases K-means Naive bayes 

References

  1. Aswani, N., Bontcheva, K., & Cunningham, H. (2006). Mining information for instance Unification. Lecture Notes in Computer Science, 4273, 329–334.CrossRefGoogle Scholar
  2. Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., & Fienberg, S. (2003). Adaptive name matching in information integration. IEEE Intelligent Systems, 18(5), 16–23.CrossRefGoogle Scholar
  3. Bourke, P., & Butler, L. (1996). Standards issues in a national bibliometric database: The Australian case. Scientometrics, 35(2), 199–207.CrossRefGoogle Scholar
  4. Carayol, N., & Cassi, L. (2009). Whos who in patents. A Bayesian approach. http://hal-paris1.archives-ouvertes.fr/hal-00631750. Accessed 15 April 2013.
  5. Churches, T., Christen, P., Lim, K., & Zhu, J. X. (2002). Preparation of name and address data for record linkage using hidden Markov models. BMC Medical Informatics and Decision Making, 2. doi: 10.1186/1472-6947-2-9.
  6. Cleuziou, G. (2008). An extended version of the k-means method for overlapping clustering. In 19th. International Conference on Pattern Recognition (ICPR 2008), pp. 1–4.Google Scholar
  7. De Bruin, R. E., & Moed, H. F. (1990). The unification of addresses in scientific publications. Informetrics 1989/90, 6578. Amsterdam: Elsevier.Google Scholar
  8. De Bruin, R. E., & Moed, H. F. (1993). Delimitation of scientific subfields using cog nitive words from corporate addresses in scientific publications. Scientometrics, 26(1), 65–80.CrossRefGoogle Scholar
  9. Domingos, P., & Pazzani, M. (1996). Beyond independence: Conditions for the optimality of the simple Bayesian classifier. In International Conference on Machine Learning (ICML), 105–112, Bari.Google Scholar
  10. Fellegi, I., & Sunter, A. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.CrossRefGoogle Scholar
  11. French, J. C., Powell, A. L., & Schulman, E. (2000). Using clustering strategies for creating authority files. Journal of the American Society for Information Science and Technology, 51, 774–786.CrossRefGoogle Scholar
  12. Galvez, C., & Moya-Anegn, F. (2006). The unification of institutional addresses applying parametrized finite-state graphs (P-FSG). Scientometrics, 69(2), 323–345.CrossRefGoogle Scholar
  13. Hand, D. J., & Yu, K. (2001). Idiots Bayes not so stupid after all? International Statistical Review, 69(3), 385–398.CrossRefMATHGoogle Scholar
  14. Hood, W., & Wilson, C. (2003). Informetric studies using databases: Opportunities and challenges. Scientometrics, 58(3), 587–608.CrossRefGoogle Scholar
  15. Huang, J., Ertekin, S., & Giles, C. L. (2006). Efficient name disambiguation for large-scale databases. PKDD 06. LNAI, 4213:536–544, Berlin: Springer.Google Scholar
  16. Jiang, Y., Zheng, H.-T., Wang, X., Lu, B., & Wu, K. (2011). Affiliation disambiguation for constructing semantic digital libraries. Journal of the American Society for Information Science and Technology, 62(6), 1029–1041.CrossRefGoogle Scholar
  17. Lamirel J.-C., Mall R., Cuxac P., & Safi G. (2011). Variations to incremental growing neural gas algorithm based on label maximization. In International Joint Conference on neural networksIJCNN 2011, p. 956–965.Google Scholar
  18. Lelu, A. (1993). Modèles neuronaux pour l’analyse de donnes documentaires et textuelles. PhD: University Paris. 6.Google Scholar
  19. Liu, N. C., Cheng, Y., & Liu, L. (2005). Academic ranking of world universities using scientometrics: A comment to the fatal attraction. Scientometrics, 64(1), 101–112.CrossRefGoogle Scholar
  20. MacQueen J. (1967). Some methods for classification and analysis of multivariate observations. Proceeding of the 5th Berkeley Symposium on Mathematical Statistics and Probability, p. 281–297.Google Scholar
  21. Moed, H. F. (2005). Citation analysis in research evaluation. Dordrecht: Springer.Google Scholar
  22. Niu, L., Wu, J., & Shi, Y. (2012). Entity disambiguation with textual and connection information. Procedia Computer Science, 9, 1249–1255.CrossRefGoogle Scholar
  23. Osareh, F., & Wilson, C. S. (2000). A comparison of Iranian scientific publications in the SCI: 1985–1989 and 1990–1994. Scientometrics, 48(3), 427–442.CrossRefGoogle Scholar
  24. Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4), 3–13.Google Scholar
  25. Sadinle, M., & Fienberg, S. E. (2012). A generalized Fellegi-Sunter framework for Multiple record linkage with application to homicide record-systems. arXiv:1205.3217. http://arxiv.org/abs/1205.3217. Accessed 15 April 2013.
  26. Sadinle, M., Hall, R., & Fienberg, S. (2010). Approaches to multiple record linkage. Cscmuedu, http://www.cs.cmu.edu/~rjhall/ISIpaperfinal.pdf. Accessed 15 April 2013.
  27. Van Raan, A. F. J. (2005). Fatal attraction: Conceptual and methodological problems in the ranking of universities by bibliometric methods. Scientometrics, 62(1), 133–143.CrossRefGoogle Scholar
  28. Ventura, S. L., Nugent, R., & Fuchs, E. R. H. (2012). Methods matter: Revamping inventor disambiguation algorithms with classification models and labeled inventor records. SSRN eLibrary. http://papers.ssrn.com/sol3/papers.cfm?abstractid=2079330.
  29. Wang, J., Berzins, K., Hicks, D., Melkers, J., Xiao, F., & Pinheiro, D. (2012). A boosted-trees method for name disambiguation. Scientometrics, 93, 1–21.CrossRefGoogle Scholar
  30. Zhou, Y., Talburt, J. R., Su, Y., & Yin, L. (2010). OYSTER: A tool for entity resolution in health information exchange. Proceedings of the 5th International Conference on Cooperation and Promotion of Information Resources in Science and Technology (COINFO 2010 E-BOOK), 358–364.Google Scholar
  31. Zitt, M., & Bassecoulard, E. (2008). Challenges for scientometric indicators: data demining, knowledge-flow measurements and diversity issues. Ethics in Science and Environmental Politics, 8, 49–60.CrossRefGoogle Scholar

Copyright information

© Akadémiai Kiadó, Budapest, Hungary 2013

Authors and Affiliations

  • Pascal Cuxac
    • 1
  • Jean-Charles Lamirel
    • 2
  • Valerie Bonvallot
    • 1
  1. 1.INIST-CNRSVandoeuvre les NancyFrance
  2. 2.LORIA-SynalpVandoeuvre les NancyFrance

Personalised recommendations