Author Name Disambiguation Using a New Categorical Distribution Similarity

  • Shaohua Li
  • Gao Cong
  • Chunyan Miao
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7523)


Author name ambiguity has been a long-standing problem which impairs the accuracy of publication retrieval and bibliometric methods. Most of the existing disambiguation methods are built on similarity measures, e.g., “Jaccard Coefficient”, between two sets of papers to be disambiguated, each set represented by a set of categorical features, e.g., coauthors and published venues. Such measures perform bad when the two sets are small, which is typical in Author Name Disambiguation. In this paper, we propose a novel categorical set similarity measure. We model an author’s preference, e.g., to venues, using a categorical distribution, and derive a likelihood ratio to estimate the likelihood that the two sets are drawn from the same distribution. This likelihood ratio is used as the similarity measure to decide whether two sets belong to the same author. This measure is mathematically principled and verified to perform well even when the cardinalities of the two compared sets are small. Additionally, we propose a new method to estimate the number of distinct authors for a given name based on the name statistics extracted from a digital library. Experiment shows that our method significantly outperforms a baseline method, a widely used benchmark method, and a real system.


Name Disambiguation Categorical Sampling Likelihood Ratio 


  1. 1.
    Agresti, A.: Categorical data analysis. Wiley series in probability and statistics. Wiley-Interscience (2002)Google Scholar
  2. 2.
    Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. ACM Trans. Knowl. Discov. Data 1 (March 2007)Google Scholar
  3. 3.
    Cota, R.G., Ferreira, A.A., Nascimento, C., Gonalves, M.A., Laender, A.H.F.: An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. J. Am. Soc. Inf. Sci. Technol. 61(9), 1853–1870 (2010)CrossRefGoogle Scholar
  4. 4.
    Gretton, A., Borgwardt, K., Rasch, M., Schlkopf, B., Smola, A.: A kernel method for the two sample problem. In: NIPS, vol. 19, pp. 513–520. MIT Press (2007)Google Scholar
  5. 5.
    Han, H., Giles, L., Zha, H., Li, C., Tsioutsiouliklis, K.: Two supervised learning approaches for name disambiguation in author citations. In: JCDL 2004. ACM (2004)Google Scholar
  6. 6.
    Li, S., Cong, G., Miao, C.: Supplementary material to author name disambiguation using a categorical distribution similarity,
  7. 7.
    Pereira, D.A., Ribeiro-Neto, B., Ziviani, N., Laender, A.H., Gonçalves, M.A., Ferreira, A.A.: Using web information for author name disambiguation. In: JCDL 2009. ACM (2009)Google Scholar
  8. 8.
    Tang, J., Fong, A.C., Wang, B., Zhang, J.: A unified probabilistic framework for name disambiguation in digital library. IEEE TKDE 99 (2011) (preprints)Google Scholar
  9. 9.
    Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., Su, Z.: Arnetminer: extraction and mining of academic social networks. In: KDD 2008. ACM (2008)Google Scholar
  10. 10.
    Torvik, V.I., Smalheiser, N.R.: Author name disambiguation in medline. ACM Trans. Knowl. Discov. Data 3, 11:1–11:29 (2009)CrossRefGoogle Scholar
  11. 11.
    Wang, X., Tang, J., Cheng, H., Yu, P.S.: Adana: Active name disambiguation. In: ICDM 2011 (2011)Google Scholar
  12. 12.
    Yin, X., Han, J., Yu, P.S.: Object distinction: Distinguishing objects with identical names by link analysis. In: ICDE 2007 (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Shaohua Li
    • 1
  • Gao Cong
    • 1
  • Chunyan Miao
    • 1
  1. 1.Nanyang Technological UniversitySingapore

Personalised recommendations