Harnessing Historical Corrections to Build Test Collections for Named Entity Disambiguation

  • Florian ReitzEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11057)


Matching mentions of persons to the actual persons (the name disambiguation problem) is central for many digital library applications. Scientists have been working on algorithms to create this matching for decades without finding a universal solution. One problem is that test collections for this problem are often small and specific to a certain collection. In this work, we present an approach that can create large test collections from historical metadata with minimal extra cost. We apply this approach to the dblp collection to generate two freely available test collections. One collection focuses on the properties of name-related defects (such as similarities of synonymous names) and one on the evaluation of disambiguation algorithms.


Name disambiguation Historical metadata dblp 



The research in this paper is funded by the Leibniz Competition, grant no. LZI-SAW-2015-2. The author thanks Oliver Hoffmann for providing the data on which the dblp test collection is built and Marcel R. Ackermann for helpful discussions and suggestions.


  1. 1.
    Elliot, S.: Survey of author name disambiguation: 2004 to 2010 (2010). Accessed Apr 2018
  2. 2.
    Fan, X., Wang, J., Pu, X., Zhou, L., Lv, B.: On graph-based name disambiguation. J. Data Inf. Qual. 2(2), 10 (2011)Google Scholar
  3. 3.
    Ferreira, A.A., Gonçalves, M.A., Laender, A.H.F.: A brief survey of automatic methods for author name disambiguation. ACM Sigmod Rec. 41(2), 15–26 (2012)CrossRefGoogle Scholar
  4. 4.
    Han, H., Zha, H., Giles, C.L.: Name disambiguation in author citations using a K-way spectral clustering method. In: Proceedings of the Joint Conference on Digital Libraries, JCDL 2005, Denver, CO, USA, pp. 334–343. ACM (2005)Google Scholar
  5. 5.
    Hoffmann, O., Reitz, F.: hdblp: historical data of the dblp collection, April 2018. Zenodo [dataset].
  6. 6.
    Kang, I., Kim, P., Lee, S., Jung, H., You, B.: Construction of a large-scale test set for author disambiguation. Inf. Process. Manag. 47(3), 452–465 (2011)CrossRefGoogle Scholar
  7. 7.
    Levin, M., Krawczyk, S., Bethard, S., Jurafsky, D.: Citation-based bootstrapping for large-scale author disambiguation. JASIST 63(5), 1030–1047 (2012)CrossRefGoogle Scholar
  8. 8.
    Momeni, F., Mayr, P.: Evaluating co-authorship networks in author name disambiguation for common names. In: Fuhr, N., Kovács, L., Risse, T., Nejdl, W. (eds.) TPDL 2016. LNCS, vol. 9819, pp. 386–391. Springer, Cham (2016). Scholar
  9. 9.
    Müller, M., Reitz, F., Roy, N.: Data sets for author name disambiguation: an empirical analysis and a new resource. Scientometrics 111(3), 1467–1500 (2017)CrossRefGoogle Scholar
  10. 10.
    Reitz, F.: Two test collections for the author name disambiguation problem based on DBLP, April 2018. Zenodo [dataset].
  11. 11.
    Reitz, F., Hoffmann, O.: Did they notice? – a case-study on the community contribution to data quality in DBLP. In: Gradmann, S., Borri, F., Meghini, C., Schuldt, H. (eds.) TPDL 2011. LNCS, vol. 6966, pp. 204–215. Springer, Heidelberg (2011). Scholar
  12. 12.
    Reuther, P.: Namen sind wie Schall und Rauch: Ein semantisch orientierter Ansatz zum Personal Name Matching. Ph.D. thesis, University of Trier, Germany (2007)Google Scholar
  13. 13.
    Santana, A.F., Gonçalves, M.A., Laender, A.H.F., Ferreira, A.A.: On the combination of domain-specific heuristics for author name disambiguation: the nearest cluster method. Int. J. Dig. Libr. 16(3–4), 229–246 (2015)CrossRefGoogle Scholar
  14. 14.
    Shin, D., Kim, T., Choi, J., Kim, J.: Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics 100(1), 15–50 (2014)CrossRefGoogle Scholar
  15. 15.
    Sun, C., Shen, D., Kou, Y., Nie, T., Yu, G.: Topological features based entity disambiguation. J. Comput. Sci. Technol. 31(5), 1053–1068 (2016)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Schloss Dagstuhl LZI, dblp groupWadernGermany

Personalised recommendations