Harnessing Historical Corrections to Build Test Collections for Named Entity Disambiguation
Matching mentions of persons to the actual persons (the name disambiguation problem) is central for many digital library applications. Scientists have been working on algorithms to create this matching for decades without finding a universal solution. One problem is that test collections for this problem are often small and specific to a certain collection. In this work, we present an approach that can create large test collections from historical metadata with minimal extra cost. We apply this approach to the dblp collection to generate two freely available test collections. One collection focuses on the properties of name-related defects (such as similarities of synonymous names) and one on the evaluation of disambiguation algorithms.
KeywordsName disambiguation Historical metadata dblp
The research in this paper is funded by the Leibniz Competition, grant no. LZI-SAW-2015-2. The author thanks Oliver Hoffmann for providing the data on which the dblp test collection is built and Marcel R. Ackermann for helpful discussions and suggestions.
- 1.Elliot, S.: Survey of author name disambiguation: 2004 to 2010 (2010). http://digitalcommons.unl.edu/libphilprac/473. Accessed Apr 2018
- 2.Fan, X., Wang, J., Pu, X., Zhou, L., Lv, B.: On graph-based name disambiguation. J. Data Inf. Qual. 2(2), 10 (2011)Google Scholar
- 4.Han, H., Zha, H., Giles, C.L.: Name disambiguation in author citations using a K-way spectral clustering method. In: Proceedings of the Joint Conference on Digital Libraries, JCDL 2005, Denver, CO, USA, pp. 334–343. ACM (2005)Google Scholar
- 5.Hoffmann, O., Reitz, F.: hdblp: historical data of the dblp collection, April 2018. Zenodo [dataset]. https://doi.org/10.5281/zenodo.1213051
- 10.Reitz, F.: Two test collections for the author name disambiguation problem based on DBLP, April 2018. Zenodo [dataset]. https://doi.org/10.5281/zenodo.1215650
- 11.Reitz, F., Hoffmann, O.: Did they notice? – a case-study on the community contribution to data quality in DBLP. In: Gradmann, S., Borri, F., Meghini, C., Schuldt, H. (eds.) TPDL 2011. LNCS, vol. 6966, pp. 204–215. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24469-8_22CrossRefGoogle Scholar
- 12.Reuther, P.: Namen sind wie Schall und Rauch: Ein semantisch orientierter Ansatz zum Personal Name Matching. Ph.D. thesis, University of Trier, Germany (2007)Google Scholar