Efficient Name Disambiguation for Large-Scale Databases

  • Jian Huang
  • Seyda Ertekin
  • C. Lee Giles
Conference paper

DOI: 10.1007/11871637_53

Part of the Lecture Notes in Computer Science book series (LNCS, volume 4213)
Cite this paper as:
Huang J., Ertekin S., Giles C.L. (2006) Efficient Name Disambiguation for Large-Scale Databases. In: Fürnkranz J., Scheffer T., Spiliopoulou M. (eds) Knowledge Discovery in Databases: PKDD 2006. PKDD 2006. Lecture Notes in Computer Science, vol 4213. Springer, Berlin, Heidelberg

Abstract

Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework for solving the name disambiguation problem: a blocking method retrieves candidate classes of authors with similar names and a clustering method, DBSCAN, clusters papers by author. The distance metric between papers used in DBSCAN is calculated by an online active selection support vector machine algorithm (LASVM), yielding a simpler model, lower test errors and faster prediction time than a standard SVM. We prove that by recasting transitivity as density reachability in DBSCAN, transitivity is guaranteed for core points. For evaluation, we manually annotated 3,355 papers yielding 490 authors and achieved 90.6% pairwise-F1. For scalability, authors in the entire CiteSeer dataset, over 700,000 papers, were readily disambiguated.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Jian Huang
    • 1
  • Seyda Ertekin
    • 2
  • C. Lee Giles
    • 1
    • 2
  1. 1.College of Information Sciences and TechnologyThe Pennsylvania State UniversityUniversity ParkU.S.A.
  2. 2.Department of Computer Science and EngineeringThe Pennsylvania State UniversityUniversity ParkU.S.A.

Personalised recommendations