Efficient Name Disambiguation for Large-Scale Databases
- Jian HuangAffiliated withCollege of Information Sciences and Technology, The Pennsylvania State University
- , Seyda ErtekinAffiliated withDepartment of Computer Science and Engineering, The Pennsylvania State University
- , C. Lee GilesAffiliated withCollege of Information Sciences and Technology, The Pennsylvania State UniversityDepartment of Computer Science and Engineering, The Pennsylvania State University
Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework for solving the name disambiguation problem: a blocking method retrieves candidate classes of authors with similar names and a clustering method, DBSCAN, clusters papers by author. The distance metric between papers used in DBSCAN is calculated by an online active selection support vector machine algorithm (LASVM), yielding a simpler model, lower test errors and faster prediction time than a standard SVM. We prove that by recasting transitivity as density reachability in DBSCAN, transitivity is guaranteed for core points. For evaluation, we manually annotated 3,355 papers yielding 490 authors and achieved 90.6% pairwise-F1. For scalability, authors in the entire CiteSeer dataset, over 700,000 papers, were readily disambiguated.
- Efficient Name Disambiguation for Large-Scale Databases
- Book Title
- Knowledge Discovery in Databases: PKDD 2006
- Book Subtitle
- 10th European Conference on Principles and Practice of Knowledge Discovery in Databases Berlin, Germany, September 18-22, 2006 Proceedings
- pp 536-544
- Print ISBN
- Online ISBN
- Series Title
- Lecture Notes in Computer Science
- Series Volume
- Series ISSN
- Springer Berlin Heidelberg
- Copyright Holder
- Springer-Verlag Berlin Heidelberg
- Additional Links
- Industry Sectors
- Editor Affiliations
- 18. Knowledge Engineering Group, Technische Universität Darmstadt
- 19. Max Planck Institute for Computer Science
- 20. Faculty of Computer Science, Otto-von-Guericke-University Magdeburg
- Author Affiliations
- 21. College of Information Sciences and Technology, The Pennsylvania State University, University Park, PA, 16802, U.S.A.
- 22. Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, 16802, U.S.A.
To view the rest of this content please follow the download PDF link above.