Skip to main content

Dynamic author name disambiguation for growing digital libraries

Abstract

When a digital library user searches for publications by an author name, she often sees a mixture of publications by different authors who have the same name. With the growth of digital libraries and involvement of more authors, this author ambiguity problem is becoming critical. Author disambiguation (AD) often tries to solve this problem by leveraging metadata such as coauthors, research topics, publication venues and citation information, since more personal information such as the contact details is often restricted or missing. In this paper, we study the problem of how to efficiently disambiguate author names given an incessant stream of published papers. To this end, we propose a “BatchAD+IncAD” framework for dynamic author disambiguation. First, we perform batch author disambiguation (BatchAD) to disambiguate all author names at a given time by grouping all records (each record refers to a paper with one of its author names) into disjoint clusters. This establishes a one-to-one mapping between the clusters and real-world authors. Then, for newly added papers, we periodically perform incremental author disambiguation (IncAD), which determines whether each new record can be assigned to an existing cluster, or to a new cluster not yet included in the previous data. Based on the new data, IncAD also tries to correct previous AD results. Our main contributions are: (1) We demonstrate with real data that a small number of new papers often have overlapping author names with a large portion of existing papers, so it is challenging for IncAD to effectively leverage previous AD results. (2) We propose a novel IncAD model which aggregates metadata from a cluster of records to estimate the author’s profile such as her coauthor distributions and keyword distributions, in order to predict how likely it is that a new record is “produced” by the author. (3) Using two labeled datasets and one large-scale raw dataset, we show that the proposed method is much more efficient than state-of-the-art methods while ensuring high accuracy.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Notes

  1. http://usefulenglish.ru/vocabulary/english-names.

  2. ftp://ftp.cs.cornell.edu/pub/smart/english.stop.

  3. http://svmlight.joachims.org/.

  4. http://academic.research.microsoft.com

  5. https://github.com/yaya213/DBLP-Name-Disambiguation-Dataset.

References

  • Amigó, E., Gonzalo, J., Artiles, J., & Verdejo, F. (2009). A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval, 12, 461–486.

    Article  Google Scholar 

  • Balog, K., Azzopardi, L., & de Rijke, M. (2009). Resolving person names in web people search. Weaving services and people on the World Wide Web, 1, 301.

    Article  Google Scholar 

  • Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1), 5.

    Article  Google Scholar 

  • Bollen, J., Rodriguez, M. A., Van De Sompel, H., Balakireva, L. L., & Hagberg, A. A. (2007). The largest scholarly semantic network...ever. In Proceedings of WWW’07 (pp. 1247–1248).

  • Byung-won, O., & Lee, D. (2007). Scalable name disambiguation using multi-level graph partition. In SIAM international conference on data mining.

  • Byung-won, O., Elmacioglu, E., Lee, D., Kang, J., & Pei, J. (2006). An effective approach to entity resolution problem using quasi-clique and its application to digital libraries. In Proceedings of JCDL’06 (pp. 51–52).

  • Byung-won, O., Lee, D., Kang, J., & Mitra, P. (2005). Comparative study of name disambiguation problem using a scalable blocking-based framework. In ACM/IEEE joint conference on digital libraries (pp. 344–353).

  • Cheng, Y., Chen, Z., Wang, J., Agrawal, A., & Choudhary, A. (2013). Bootstrapping active name disambiguation with crowdsourcing. In Proceedings of the 22nd ACM international conference on information & knowledge management (pp. 1213–1216). ACM.

  • Culotta, A., Kanani, P., Hall, R., Wick, M., & McCallum, A. (2007). Author disambiguation using error-driven machine learning with a ranking loss function. In Workshop on information integration on the Web - WIIW. ACM.

  • de Carvalho, A. P., Ferreira, A. A., Laender, A. H. F., & Gonçalves, M. A. (2011). Incremental unsupervised name disambiguation in cleaned digital libraries. Journal of Information and Data Management, 2(3), 289.

    Google Scholar 

  • Esperidião, L. V. B., Ferreira, A. A., Laender, A. H. F., Gonçalves, M. A., Gomes, D. M., Tavares, A. I., & de Assis, G. T. (2014). Reducing fragmentation in incremental author name disambiguation. Journal of Information and Data Management, 5(3), 293.

    Google Scholar 

  • Fan, X., Wang, J., Bing, L., Zhou, L., & Hu, W. (2008). Ghost: An effective graph-based framework for name distinction. In Proceedings of CIKM’08 (pp. 1449–1450).

  • Godoi, T. A., Torres, R. S., Carvalho, A. M. B. R., Gonçalves, M. A., Ferreira, A. A., Fan, W., & Fox, E. A. (2013). A relevance feedback approach for the author name disambiguation problem. In Proceedings of the 13th ACM/IEEE-CS joint conference on digital libraries (pp. 209–218). ACM.

  • Gong, J., & Oard, D. W. (2009). Selecting hierarchical clustering cut points for web person-name disambiguation. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval (pp. 778–779). ACM.

  • Han, H., Zha, H., & Giles, C. L. (2003). A model-based k-means algorithm for name disambiguation. In International semantic web conference.

  • Han, H., Zha, H., & Giles, C. L. (2005). Name disambiguation in author citations using a k-way spectral clustering method. In ACM/IEEE joint conference on digital libraries (pp. 334–343).

  • Han, H., Zha, H., Li, C., Tsioutsiouliklis, K., & Giles, C. L. (2004). Two supervised learning approaches for name disambiguation in author citations. In ACM/IEEE joint conference on digital libraries (pp. 296–305).

  • Huang, J., Ertekin, S., & Giles, C. L. (2006). Efficient name disambiguation for large-scale databases. In Proceedings of the 10th European conference on principles and practice of knowledge discovery in databases (pp. 536–544).

  • Jiang, L., Wang, J., An, N., Wang, S., Zhan, J., & Li, L. (2009). Two birds with one stone: A graph-based framework for disambiguating and tagging people names in web search. In Proceedings of WWW’09 (pp. 1201–1202).

  • Jinha, A. E. (2010). Article 50 million: An estimate of the number of scholarly articles in existence. Learned Publishing, 23(3), 258–263.

    Article  Google Scholar 

  • Joachims, T. (1999). Making large-scale svm learning practical. Advances in Kernel Methods (pp. 169–184). Cambridge, MA: MIT Press.

  • McRae-Spencer, D. M., & Shadbolt, N. R. (2006). Also by the same author: Aktiveauthor, a citation graph approach to name disambiguation. In Proceedings of the 6th ACM/IEEE-CS joint conference on digital libraries (pp. 53–54). ACM.

  • Monz, C., & Weerkamp, W. (2009). A comparison of retrieval-based hierarchical clustering approaches to person name disambiguation. In Proceedings of SIGIR’09 (pp. 650–651).

  • Na, S., Lee, S., Jung, H., Kim, P., Sung, W., & Lee, J. (2009). On co-authorship for author disambiguation. Information Processing and Management, 45, 84–97.

    Article  Google Scholar 

  • Nguyen, H., & Cao, T. (2008). Named entity disambiguation: A hybrid statistical and rule-based incremental approach. The Semantic Web, 5367, 420–433.

    Article  Google Scholar 

  • Pereira, D. A., Ribeiro-neto, B. A., Ziviani, N., Laender, A. H. F., Gonçalves, M. A., & Ferreira, A. A. (2009). Using web information for author name disambiguation. In ACM/IEEE joint conference on digital libraries (pp. 49–58).

  • Qian, Y., Hu, Y., Cui, J., Zheng, Q., & Nie, Z. (2011). Combining machine learning and human judgment in author disambiguation. In Proceedings of the 20th ACM international conference on information and knowledge management (pp. 1241–1246). ACM.

  • Rahm, E., & Bernstein, P. A. (2001). A survey of approaches to automatic schema matching. The VLDB Journal, 10(4), 334–350.

    MATH  Article  Google Scholar 

  • Song, Y., Huang, J., Councill, I. G., Li, J., & Giles, C. L. (2007). Efficient topic-based unsupervised name disambiguation. In ACM/IEEE joint conference on digital libraries (pp. 342–351).

  • Tan, Y. F., Kan, Min, y., & Lee, D. (2006). Search engine driven author disambiguation. In ACM/IEEE joint conference on digital libraries (pp. 314–315).

  • Tang, J., Alvis Cheuk, M., Fong, B. W., & Zhang, J. (2012). A unified probabilistic framework for name disambiguation in digital library. Knowledge and Data Engineering, IEEE Transactions on, 24(6), 975–987.

    Article  Google Scholar 

  • Torvik, V. I., & Smalheiser, N. R. (2008). Author name disambiguation in medline. ACM Transactions on Knowledge Discovery from Data, 3(3), 11.

    Google Scholar 

  • Treeratpituk, Pucktada, & Giles, L. C. (2009). Disambiguating authors in academic publications using random forests. In ACM/IEEE joint conference on digital libraries (pp. 39–48).

  • Treeratpituk, P. (2012). Person name disambiguation in the multicultural and online setting. Pennsylvania: The Pennsylvania State University.

    Google Scholar 

  • Wang, F., Li, J., Tang, J., Zhang, J., & Wang, K. (2008). Name disambiguation using atomic clusters. In Proceedings of the 9th international conference on web-age information management (pp. 357–364).

  • Wang, X., Tang, J., Cheng, H., & Adana, P. S. Y. (2011). Active name disambiguating. In Proceedings of 2011 IEEE international conference on data mining.

  • Yang, K. H., Jiang, J. Y., Lee, H. M., & Ho, J. M. (2006). Extracting citation relationships from web documents for author disambiguation. Technical report, Technical Report (TR-IIS-06-017).

  • Yin, X., Han, J., & Yu, P. S. (2007). Object distinction: Distinguishing objects with identical names. In Data engineering, 2007. ICDE 2007. IEEE 23rd international conference on, (pp 1242–1246). IEEE.

  • Zhang, R., Shen, D., Kou, Y., & Nie, T. (2010). Author name disambiguation for citations on the deep web. Web-Age Information Management, 6185, 198–209.

    Article  Google Scholar 

Download references

Acknowledgments

The authors are grateful to the reviewers for reviewing this paper. The authors thank Professor Jim Kirk from the computer science department of Union University for revising the English expressions of the paper. The authors also thank the Microsoft Academic Search team for providing the API for data accessing. This research is supported by the National Science Foundation of China under Grant Nos. 61173112, 60921003, 91118005, 91218301, 61221063; National High Technology Research and Development Program of China under Grant No. 2012AA011003; Key Projects in the National Science and Technology Pillar Program under Grant Nos. 2011BAK08B02, 2011BAK08B05, 2012BAH16F02; National Key Technologies R&D Program of China under Grant No. 2013BAK09B01.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanan Qian.

Additional information

The paper is original, its content has never been published elsewhere before.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Qian, Y., Zheng, Q., Sakai, T. et al. Dynamic author name disambiguation for growing digital libraries. Inf Retrieval J 18, 379–412 (2015). https://doi.org/10.1007/s10791-015-9261-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10791-015-9261-3

Keywords

  • Digital library
  • Author disambiguation
  • Data stream
  • Clustering
  • Multi-classification