Advertisement

Information Retrieval Journal

, Volume 18, Issue 5, pp 379–412 | Cite as

Dynamic author name disambiguation for growing digital libraries

  • Yanan Qian
  • Qinghua Zheng
  • Tetsuya Sakai
  • Junting Ye
  • Jun Liu
Article

Abstract

When a digital library user searches for publications by an author name, she often sees a mixture of publications by different authors who have the same name. With the growth of digital libraries and involvement of more authors, this author ambiguity problem is becoming critical. Author disambiguation (AD) often tries to solve this problem by leveraging metadata such as coauthors, research topics, publication venues and citation information, since more personal information such as the contact details is often restricted or missing. In this paper, we study the problem of how to efficiently disambiguate author names given an incessant stream of published papers. To this end, we propose a “BatchAD+IncAD” framework for dynamic author disambiguation. First, we perform batch author disambiguation (BatchAD) to disambiguate all author names at a given time by grouping all records (each record refers to a paper with one of its author names) into disjoint clusters. This establishes a one-to-one mapping between the clusters and real-world authors. Then, for newly added papers, we periodically perform incremental author disambiguation (IncAD), which determines whether each new record can be assigned to an existing cluster, or to a new cluster not yet included in the previous data. Based on the new data, IncAD also tries to correct previous AD results. Our main contributions are: (1) We demonstrate with real data that a small number of new papers often have overlapping author names with a large portion of existing papers, so it is challenging for IncAD to effectively leverage previous AD results. (2) We propose a novel IncAD model which aggregates metadata from a cluster of records to estimate the author’s profile such as her coauthor distributions and keyword distributions, in order to predict how likely it is that a new record is “produced” by the author. (3) Using two labeled datasets and one large-scale raw dataset, we show that the proposed method is much more efficient than state-of-the-art methods while ensuring high accuracy.

Keywords

Digital library Author disambiguation Data stream Clustering Multi-classification 

Notes

Acknowledgments

The authors are grateful to the reviewers for reviewing this paper. The authors thank Professor Jim Kirk from the computer science department of Union University for revising the English expressions of the paper. The authors also thank the Microsoft Academic Search team for providing the API for data accessing. This research is supported by the National Science Foundation of China under Grant Nos. 61173112, 60921003, 91118005, 91218301, 61221063; National High Technology Research and Development Program of China under Grant No. 2012AA011003; Key Projects in the National Science and Technology Pillar Program under Grant Nos. 2011BAK08B02, 2011BAK08B05, 2012BAH16F02; National Key Technologies R&D Program of China under Grant No. 2013BAK09B01.

References

  1. Amigó, E., Gonzalo, J., Artiles, J., & Verdejo, F. (2009). A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval, 12, 461–486.CrossRefGoogle Scholar
  2. Balog, K., Azzopardi, L., & de Rijke, M. (2009). Resolving person names in web people search. Weaving services and people on the World Wide Web, 1, 301.CrossRefGoogle Scholar
  3. Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1), 5.CrossRefGoogle Scholar
  4. Bollen, J., Rodriguez, M. A., Van De Sompel, H., Balakireva, L. L., & Hagberg, A. A. (2007). The largest scholarly semantic network...ever. In Proceedings of WWW’07 (pp. 1247–1248).Google Scholar
  5. Byung-won, O., & Lee, D. (2007). Scalable name disambiguation using multi-level graph partition. In SIAM international conference on data mining.Google Scholar
  6. Byung-won, O., Elmacioglu, E., Lee, D., Kang, J., & Pei, J. (2006). An effective approach to entity resolution problem using quasi-clique and its application to digital libraries. In Proceedings of JCDL’06 (pp. 51–52).Google Scholar
  7. Byung-won, O., Lee, D., Kang, J., & Mitra, P. (2005). Comparative study of name disambiguation problem using a scalable blocking-based framework. In ACM/IEEE joint conference on digital libraries (pp. 344–353).Google Scholar
  8. Cheng, Y., Chen, Z., Wang, J., Agrawal, A., & Choudhary, A. (2013). Bootstrapping active name disambiguation with crowdsourcing. In Proceedings of the 22nd ACM international conference on information & knowledge management (pp. 1213–1216). ACM.Google Scholar
  9. Culotta, A., Kanani, P., Hall, R., Wick, M., & McCallum, A. (2007). Author disambiguation using error-driven machine learning with a ranking loss function. In Workshop on information integration on the Web - WIIW. ACM.Google Scholar
  10. de Carvalho, A. P., Ferreira, A. A., Laender, A. H. F., & Gonçalves, M. A. (2011). Incremental unsupervised name disambiguation in cleaned digital libraries. Journal of Information and Data Management, 2(3), 289.Google Scholar
  11. Esperidião, L. V. B., Ferreira, A. A., Laender, A. H. F., Gonçalves, M. A., Gomes, D. M., Tavares, A. I., & de Assis, G. T. (2014). Reducing fragmentation in incremental author name disambiguation. Journal of Information and Data Management, 5(3), 293.Google Scholar
  12. Fan, X., Wang, J., Bing, L., Zhou, L., & Hu, W. (2008). Ghost: An effective graph-based framework for name distinction. In Proceedings of CIKM’08 (pp. 1449–1450).Google Scholar
  13. Godoi, T. A., Torres, R. S., Carvalho, A. M. B. R., Gonçalves, M. A., Ferreira, A. A., Fan, W., & Fox, E. A. (2013). A relevance feedback approach for the author name disambiguation problem. In Proceedings of the 13th ACM/IEEE-CS joint conference on digital libraries (pp. 209–218). ACM.Google Scholar
  14. Gong, J., & Oard, D. W. (2009). Selecting hierarchical clustering cut points for web person-name disambiguation. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval (pp. 778–779). ACM.Google Scholar
  15. Han, H., Zha, H., & Giles, C. L. (2003). A model-based k-means algorithm for name disambiguation. In International semantic web conference.Google Scholar
  16. Han, H., Zha, H., & Giles, C. L. (2005). Name disambiguation in author citations using a k-way spectral clustering method. In ACM/IEEE joint conference on digital libraries (pp. 334–343).Google Scholar
  17. Han, H., Zha, H., Li, C., Tsioutsiouliklis, K., & Giles, C. L. (2004). Two supervised learning approaches for name disambiguation in author citations. In ACM/IEEE joint conference on digital libraries (pp. 296–305).Google Scholar
  18. Huang, J., Ertekin, S., & Giles, C. L. (2006). Efficient name disambiguation for large-scale databases. In Proceedings of the 10th European conference on principles and practice of knowledge discovery in databases (pp. 536–544).Google Scholar
  19. Jiang, L., Wang, J., An, N., Wang, S., Zhan, J., & Li, L. (2009). Two birds with one stone: A graph-based framework for disambiguating and tagging people names in web search. In Proceedings of WWW’09 (pp. 1201–1202).Google Scholar
  20. Jinha, A. E. (2010). Article 50 million: An estimate of the number of scholarly articles in existence. Learned Publishing, 23(3), 258–263.CrossRefGoogle Scholar
  21. Joachims, T. (1999). Making large-scale svm learning practical. Advances in Kernel Methods (pp. 169–184). Cambridge, MA: MIT Press.Google Scholar
  22. McRae-Spencer, D. M., & Shadbolt, N. R. (2006). Also by the same author: Aktiveauthor, a citation graph approach to name disambiguation. In Proceedings of the 6th ACM/IEEE-CS joint conference on digital libraries (pp. 53–54). ACM.Google Scholar
  23. Monz, C., & Weerkamp, W. (2009). A comparison of retrieval-based hierarchical clustering approaches to person name disambiguation. In Proceedings of SIGIR’09 (pp. 650–651).Google Scholar
  24. Na, S., Lee, S., Jung, H., Kim, P., Sung, W., & Lee, J. (2009). On co-authorship for author disambiguation. Information Processing and Management, 45, 84–97.CrossRefGoogle Scholar
  25. Nguyen, H., & Cao, T. (2008). Named entity disambiguation: A hybrid statistical and rule-based incremental approach. The Semantic Web, 5367, 420–433.CrossRefGoogle Scholar
  26. Pereira, D. A., Ribeiro-neto, B. A., Ziviani, N., Laender, A. H. F., Gonçalves, M. A., & Ferreira, A. A. (2009). Using web information for author name disambiguation. In ACM/IEEE joint conference on digital libraries (pp. 49–58).Google Scholar
  27. Qian, Y., Hu, Y., Cui, J., Zheng, Q., & Nie, Z. (2011). Combining machine learning and human judgment in author disambiguation. In Proceedings of the 20th ACM international conference on information and knowledge management (pp. 1241–1246). ACM.Google Scholar
  28. Rahm, E., & Bernstein, P. A. (2001). A survey of approaches to automatic schema matching. The VLDB Journal, 10(4), 334–350.MATHCrossRefGoogle Scholar
  29. Song, Y., Huang, J., Councill, I. G., Li, J., & Giles, C. L. (2007). Efficient topic-based unsupervised name disambiguation. In ACM/IEEE joint conference on digital libraries (pp. 342–351).Google Scholar
  30. Tan, Y. F., Kan, Min, y., & Lee, D. (2006). Search engine driven author disambiguation. In ACM/IEEE joint conference on digital libraries (pp. 314–315).Google Scholar
  31. Tang, J., Alvis Cheuk, M., Fong, B. W., & Zhang, J. (2012). A unified probabilistic framework for name disambiguation in digital library. Knowledge and Data Engineering, IEEE Transactions on, 24(6), 975–987.CrossRefGoogle Scholar
  32. Torvik, V. I., & Smalheiser, N. R. (2008). Author name disambiguation in medline. ACM Transactions on Knowledge Discovery from Data, 3(3), 11.Google Scholar
  33. Treeratpituk, Pucktada, & Giles, L. C. (2009). Disambiguating authors in academic publications using random forests. In ACM/IEEE joint conference on digital libraries (pp. 39–48).Google Scholar
  34. Treeratpituk, P. (2012). Person name disambiguation in the multicultural and online setting. Pennsylvania: The Pennsylvania State University.Google Scholar
  35. Wang, F., Li, J., Tang, J., Zhang, J., & Wang, K. (2008). Name disambiguation using atomic clusters. In Proceedings of the 9th international conference on web-age information management (pp. 357–364).Google Scholar
  36. Wang, X., Tang, J., Cheng, H., & Adana, P. S. Y. (2011). Active name disambiguating. In Proceedings of 2011 IEEE international conference on data mining.Google Scholar
  37. Yang, K. H., Jiang, J. Y., Lee, H. M., & Ho, J. M. (2006). Extracting citation relationships from web documents for author disambiguation. Technical report, Technical Report (TR-IIS-06-017).Google Scholar
  38. Yin, X., Han, J., & Yu, P. S. (2007). Object distinction: Distinguishing objects with identical names. In Data engineering, 2007. ICDE 2007. IEEE 23rd international conference on, (pp 1242–1246). IEEE.Google Scholar
  39. Zhang, R., Shen, D., Kou, Y., & Nie, T. (2010). Author name disambiguation for citations on the deep web. Web-Age Information Management, 6185, 198–209.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Yanan Qian
    • 1
  • Qinghua Zheng
    • 1
  • Tetsuya Sakai
    • 2
  • Junting Ye
    • 1
  • Jun Liu
    • 1
  1. 1.Department of Computer Science and TechnologyXi’an Jiaotong UniversityXi’anChina
  2. 2.Department of Computer Science and EngineeringWaseda UniversityTokyoJapan

Personalised recommendations