Advertisement

International Journal on Digital Libraries

, Volume 16, Issue 3–4, pp 229–246 | Cite as

On the combination of domain-specific heuristics for author name disambiguation: the nearest cluster method

  • Alan Filipe Santana
  • Marcos André GonçalvesEmail author
  • Alberto H. F. Laender
  • Anderson A. Ferreira
Article

Abstract

Author name disambiguation has been one of the hardest problems faced by digital libraries since their early days. Historically, supervised solutions have empirically outperformed those based on heuristics, but with the burden of having to rely on manually labeled training sets for the learning process. Moreover, most supervised solutions just apply some type of generic machine learning solution and do not exploit specific knowledge about the problem. In this article, we follow a similar reasoning, but in the opposite direction. Instead of extending an existing supervised solution, we propose a set of carefully designed heuristics and similarity functions, and apply supervision only to optimize such parameters for each particular dataset. As our experiments show, the result is a very effective, efficient and practical author name disambiguation method that can be used in many different scenarios. In fact, we show that our method can beat state-of-the-art supervised methods in terms of effectiveness in many situations while being orders of magnitude faster. It can also run without any training information, using only default parameters, and still be very competitive when compared to these supervised methods (beating several of them) and better than most existing unsupervised author name disambiguation solutions.

Keywords

Name disambiguation Supervised methods Heuristics 

Notes

Acknowledgments

This research is funded by INWeb (CNPq grant 57.3871/2008-6) and by the authors’ individual grants from CNPq, CAPES, and FAPEMIG.

References

  1. 1.
    Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern information retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston (1999)Google Scholar
  2. 2.
    Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. ACM Trans Knowl Discov Data 1(1) (2007)Google Scholar
  3. 3.
    Bordes, A., Ertekin, S., Weston, J., Bottou, L.: Fast kernel classifiers with online and active learning. J Mach Learning Res 6, 1579–1619 (2005)MathSciNetzbMATHGoogle Scholar
  4. 4.
    Cota, R.G., Ferreira, A.A., Nascimento, C., Gonçalves, M.A., Laender, A.H.F.: An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. J Am Soc Inform Sci Technol 61(9), 1853–1870 (2010)CrossRefGoogle Scholar
  5. 5.
    Fan, X., Wang, J., Pu, X., Zhou, L., Lv, B.: On graph-based name disambiguation. J Data Inform Qual 2, 10:1–10:23 (2011)CrossRefGoogle Scholar
  6. 6.
    Ferreira, A.A., Veloso, A., Gonçalves, M.A., Laender, A.H.F.: Effective self-training author name disambiguation in scholarly digital libraries. In: Proceedings of the 10th Annual Joint Conference on Digital Libraries, pp. 39–48 (2010)Google Scholar
  7. 7.
    Ferreira, A.A., Gonçalves, M.A., Laender, A.H.F.: A brief survey of automatic methods for author name disambiguation. SIGMOD Record 41(2), 15–26 (2012)CrossRefGoogle Scholar
  8. 8.
    Ferreira, A.A, Silva, R., Gonçalves, M.A., Veloso, A., Laender, A.H.F.: Active associative sampling for author name disambiguation. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 175–184 (2012)Google Scholar
  9. 9.
    Ferreira, A.A., Veloso, A., Gonçalves, M.A., Laender, A.H.F.: Self-training author name disambiguation for information scarce scenarios. J Am Soc Inform Sci Technol 65(6), 1257–1278 (2014)CrossRefGoogle Scholar
  10. 10.
    Han, H., Giles, L., Zha, H., Li, C., Tsioutsiouliklis, K.: Two supervised learning approaches for name disambiguation in author citations. In: Proceedings of the 4th ACM/IEEE Joint Conference on Digital Libraries, pp. 296–305 (2004)Google Scholar
  11. 11.
    Han, H., Xu, W., Zha, H., Giles, C.L.: A hierarchical naive bayes mixture model for name disambiguation in author citations. In: Proceedings of the ACM Symposium on Applied Computing, pp. 1065–1069 (2005)Google Scholar
  12. 12.
    Han, H., Zha, H., Giles, C.L.: Name disambiguation in author citations using a K-way spectral clustering method. In: Proceedings of JCDL, pp. 334–343 (2005)Google Scholar
  13. 13.
    Holm, S.: A Simple Sequentially Rejective Multiple Test Procedure. Scand J Stat 6(2), 65–70 (1979)MathSciNetzbMATHGoogle Scholar
  14. 14.
    Huang, J., Ertekin, S., Giles, C.L.: Efficient name disambiguation for large-scale databases. In: Proceedings of European Conference on Principles and Practice of Knowl. Discovery in Databases, pp. 536–544 (2006)Google Scholar
  15. 15.
    Kanani, P., McCallum, A., Pal, C.: Improving author coreference by resource-bounded information gathering from the web. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, pp. 429–434 (2007)Google Scholar
  16. 16.
    Kang, I.S., Na, S.H., Lee, S., Jung, H., Kim, P., Sung, W.K., Lee, J.H.: On co-authorship for author disambiguation. Inform Process Manag 45(1), 84–97 (2009)CrossRefGoogle Scholar
  17. 17.
    Kang, I.S., Kim, P., Lee, S., Jung, H., You, B.J.: Construction of a large-scale test set for author disambiguation. Inform Process Manag 47(3), 452–465 (2011)CrossRefGoogle Scholar
  18. 18.
    Lee, D., On, B.W., Kang, J., Park, S.: Effective and scalable solutions for mixed and split citation problems in digital libraries. In: Proceedings of the 2nd International Workshop on Inf. Quality in Inf. Systems, pp. 69–76 (2005)Google Scholar
  19. 19.
    Liu, W., Islamaj Doan, R., Kim, S., Comeau, D.C., Kim, W., Yeganova, L., Lu, Z., Wilbur, W.J.: Author name disambiguation for pubmed. J Assoc Inform Sci Technol 65(4), 765–781 (2014)Google Scholar
  20. 20.
    Pereira, D.A., Ribeiro-Neto, B.A., Ziviani, N., Laender, A.H.F, Gonçalves, M.A., Ferreira, A.A.: Using web information for author name disambiguation. In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 49–58 (2009)Google Scholar
  21. 21.
    Shu, L., Long, B., Meng, W.: A latent topic model for complete entity resolution. In: Proceedings of the IEEE International Conference on Data Engineering, pp. 880–891 (2009)Google Scholar
  22. 22.
    Tang, J., Fong, A.C.M., Wang, B., Zhang, J.: A unified probabilistic framework for name disambiguation in digital library. IEEE Trans Knowl Data Eng 24(6), 975–987 (2012)CrossRefGoogle Scholar
  23. 23.
    Torvik, V.I., Smalheiser, N.R.: Author name disambiguation in medline. ACM Trans Know Discov Data 3(3), 1–29 (2009)CrossRefGoogle Scholar
  24. 24.
    Treeratpituk, P., Giles, C.L.: Disambiguating authors in academic publications using random forests. In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 39–48 (2009)Google Scholar
  25. 25.
    Veloso, A., Ferreira, A.A., Gonçalves, M.A., Laender, A.H.F., Meira Jr, W.: Cost-effective on-demand associative author name disambiguation. Inform Process Manag 48(4), 680–697 (2012)CrossRefGoogle Scholar
  26. 26.
    Wu, H., Li, B., Pei, Y., He, J.: Unsupervised author disambiguation using DempsterShafer theory. Scientometrics 101(3), 1955–1972 (2014)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • Alan Filipe Santana
    • 1
  • Marcos André Gonçalves
    • 1
    Email author
  • Alberto H. F. Laender
    • 1
  • Anderson A. Ferreira
    • 2
  1. 1.Departamento de Ciência da ComputaçãoUniversidade Federal de Minas GeraisBelo HorizonteBrazil
  2. 2.Departamento de ComputaçãoUniversidade Federal de Ouro PretoOuro PretoBrazil

Personalised recommendations