Knowledge and Information Systems

, Volume 31, Issue 1, pp 129–151

Scalable clustering methods for the name disambiguation problem

Regular Paper

DOI: 10.1007/s10115-011-0397-1

Cite this article as:
On, BW., Lee, I. & Lee, D. Knowl Inf Syst (2012) 31: 129. doi:10.1007/s10115-011-0397-1

Abstract

When non-unique values are used as the identifier of entities, due to their homonym, confusion can occur. In particular, when (part of) “names” of entities are used as their identifier, the problem is often referred to as a name disambiguation problem, where goal is to sort out the erroneous entities due to name homonyms (e.g., If only last name is used as the identifier, one cannot distinguish “Masao Obama” from “Norio Obama”). In this paper, in particular, we study the scalability issue of the name disambiguation problem—when (1) a small number of entities with large contents or (2) a large number of entities get un-distinguishable due to homonyms. First, we carefully examine two of the state-of-the-art solutions to the name disambiguation problem and point out their limitations with respect to scalability. Then, we propose two scalable graph partitioning algorithms known as multi-level graph partitioning and multi-level graph partitioning and merging to solve the large-scale name disambiguation problem. Our claim is empirically validated via experimentation—our proposal shows orders of magnitude improvement in terms of performance while maintaining equivalent or reasonable accuracy compared to competing solutions.

Keywords

Name disambiguation Clustering methods Mixed entity resolution Graph partitioning Scalability 

Copyright information

© Springer-Verlag London Limited 2011

Authors and Affiliations

  1. 1.Advanced Digital Sciences CenterIllinois at Singapore Pte LtdSingaporeSingapore
  2. 2.Sorrell College of BusinessTroy UniversityTroyUSA
  3. 3.College of Information Sciences and TechnologyPennsylvania State UniversityUniversity ParkUSA

Personalised recommendations