Evaluating author name disambiguation for digital libraries: a case of DBLP
- 405 Downloads
Author name ambiguity in a digital library may affect the findings of research that mines authorship data of the library. This study evaluates author name disambiguation in DBLP, a widely used but insufficiently evaluated digital library for its disambiguation performance. In doing so, this study takes a triangulation approach that author name disambiguation for a digital library can be better evaluated when its performance is assessed on multiple labeled datasets with comparison to baselines. Tested on three types of labeled data containing 5000 to 6 M disambiguated names, DBLP is shown to assign author names quite accurately to distinct authors, resulting in pairwise precision, recall, and F1 measures around 0.90 or above overall. DBLP’s author name disambiguation performs well even on large ambiguous name blocks but deficiently on distinguishing authors with the same names. Compared to other disambiguation algorithms, DBLP’s disambiguation performance is quite competitive, possibly due to its hybrid disambiguation approach combining algorithmic disambiguation and manual error correction. A discussion follows on strengths and weaknesses of labeled datasets used in this study for future efforts to evaluate author name disambiguation on a digital library scale.
KeywordsAuthor name disambiguation Digital library Triangulation Disambiguation evaluation DBLP
I would like to thank Florian Reitz (Leibniz Center for Informatics, Schloss Dagstuhl, Germany) for providing the list of synonyms in DBLP and Alan Filipe Santana (Departamento de Ciência da Computação, Universidade Federal de Minas Gerais, Brazil) for sharing the raw KISTI dataset. I am also thankful to anonymous reviewers for their comments. This work was supported by grants from the National Science Foundation (Grants #1561687 and #1535370), the Alfred P. Sloan Foundation, and the Ewing Marion Kauffman Foundation.
- Bilder, G. (2011). Disambiguation without de-duplication: Modeling authority and trust in the ORCID system. Retrieved from https://www.crossref.org/wp/labs/whitepapers/disambiguation-deduplication-wp-v4.pdf.
- Cota, R. G., Ferreira, A. A., Nascimento, C., Goncalves, M. A., & Laender, A. H. F. (2010). An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 61(9), 1853–1870.CrossRefGoogle Scholar
- Han, H., Zha, H. Y., & Giles, C. L. (2005). Name disambiguation spectral in author citations using a K-way clustering method. In Proceedings of the 5th ACM/IEEE joint conference on digital libraries, proceedings, pp. 334–343.Google Scholar
- On, B. W., Lee, D., Kang, J., & Mitra, P. (2005). Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proceedings of the 5th ACM/IEEE joint conference on digital libraries, proceedings, pp. 344–353.Google Scholar
- Reitz, F., & Hoffmann, O. (2013). Learning from the past: An analysis of person name corrections in the DBLP collection and social network properties of affected entities. In T. Özyer, J. Rokne, G. Wagner, & A. H. P. Reuser (Eds.), The influence of technology on social network analysis and mining (pp. 427–453). Vienna: Springer.CrossRefGoogle Scholar
- Wang, X., Tang, J., Cheng, H., & Yu, P. S. (2011). ADANA: Active name disambiguation. In Paper presented at the 2011 IEEE 11th international conference on data mining. http://ieeexplore.ieee.org/document/6137284/.