Knowledge and Information Systems

, Volume 31, Issue 1, pp 129–151 | Cite as

Scalable clustering methods for the name disambiguation problem

Regular Paper

Abstract

When non-unique values are used as the identifier of entities, due to their homonym, confusion can occur. In particular, when (part of) “names” of entities are used as their identifier, the problem is often referred to as a name disambiguation problem, where goal is to sort out the erroneous entities due to name homonyms (e.g., If only last name is used as the identifier, one cannot distinguish “Masao Obama” from “Norio Obama”). In this paper, in particular, we study the scalability issue of the name disambiguation problem—when (1) a small number of entities with large contents or (2) a large number of entities get un-distinguishable due to homonyms. First, we carefully examine two of the state-of-the-art solutions to the name disambiguation problem and point out their limitations with respect to scalability. Then, we propose two scalable graph partitioning algorithms known as multi-level graph partitioning and multi-level graph partitioning and merging to solve the large-scale name disambiguation problem. Our claim is empirically validated via experimentation—our proposal shows orders of magnitude improvement in terms of performance while maintaining equivalent or reasonable accuracy compared to competing solutions.

Keywords

Name disambiguation Clustering methods Mixed entity resolution Graph partitioning Scalability 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aygun R (2008) S2S: structural-to-syntactic matching similar documents. Knowl Inform Syst 16: 303–329CrossRefGoogle Scholar
  2. 2.
    Banerjee A, Basu S, Merugu S (2007) Multi-way clustering on relation graphs. In: Proceedings of the SIAM data mining, November 2007Google Scholar
  3. 3.
    Bekkerman R, McCallum A (2005) Disambiguating web appearances of people in a social network. In: Proceedings of international world wide web conferenceGoogle Scholar
  4. 4.
    Cheng D, Kannan R, Vempala S, Wang G (2005) A divide-and-merge methodology for clustering. ACM Trans Database SystGoogle Scholar
  5. 5.
    Cohen W, Ravikumar P, Fienberg S (2003) A Comparison of string distance metrics for name-matching tasks. Proceedings of the IIWEB workshopGoogle Scholar
  6. 6.
    Cui X, Potok T, Palathingal P (2005) Document clustering using particle swarm optimization. In: Proceedings of swarm intelligene symposiumGoogle Scholar
  7. 7.
    Dhillon I, Guan Y, Kulis B (2005) A Fast kernel-based multilevel algorithm for graph clustering. Proceedings of ACM SIGKDD conference on knowledge discovery and data miningGoogle Scholar
  8. 8.
    Doan A, Lu Y, Lee Y, Han J (2003) Profile-based object matching for information integration. IEEE Intell Syst, September/October, 2–7Google Scholar
  9. 9.
    Dorneles C, Goncalves R, Mello R (2010) Approximate data instance matching: a survey. Knowl Inform SystGoogle Scholar
  10. 10.
    Frey B, Dueck D (2007) Clustering by passing messages between data points. Science 315Google Scholar
  11. 11.
    Golub G, Van Loan C (1996) Matrix computations, 3rd edn. Johns Hopkins university press, BaltimoreMATHGoogle Scholar
  12. 12.
    Halbert D (2008) Record linkage. Am J Publ Health 36(12): 1412–1416Google Scholar
  13. 13.
    Han J, Kamber M, Tung A (2001) Spatial clustering methods in data mining: a survey. In geographic data mining and knowledge discovery. Taylor and Francis, LondonGoogle Scholar
  14. 14.
    Han H, Giles C, Zha H (2005) Name disambiguation in author citations using a k-way spectral clustering method. In: Proceedings of ACM/IEEE joint conference on digital libraries, June 2005Google Scholar
  15. 15.
    Hammouda K, Kamel M (2004) Document similarity using a phrase indexing graph model. Knowl Inform Syst 6: 710–727CrossRefGoogle Scholar
  16. 16.
    Heath M (2002) Scientific computing: an introductory survey. Prentice Hall, Englewood CliffsGoogle Scholar
  17. 17.
    Hendrickson B, Leland R (1992) An improved spectral graph partitioning algorithm for mapping parallel computations. Technical report, SAND92-1460, Sandia National Lab, AlbuquerqueGoogle Scholar
  18. 18.
    Hendrickson B, Leland R (1994) The Chaco user’s guide: version 2.0. SandiaGoogle Scholar
  19. 19.
    Hernandez M, Stolfo S (1995) The merge/purge problem for large databases. ACM SIGMOD/PODS conferenceGoogle Scholar
  20. 20.
    Hong Y, On B, Lee D (2004) System support for name authority control problem in digital libraries: OpenDBLP approach. In: Proceedings of European conference on digital libraies, Bath, UK, September 2004Google Scholar
  21. 21.
    Howard S, Tang H, Berry M, Martin D (2009) GTP: general text parser. http://www.cs.utk.edu/~lsi/
  22. 22.
    Karypis G, Kumar V (1996) A parallel algorithm for multilevel graph partitioning and sparse matrix ordering. J Parallel Distributed Comput 48(1): 71–95CrossRefGoogle Scholar
  23. 23.
    Lee D, On B, Kang J, Park S (2005) Effective and scalable solutions for mixed and split citation problems in digital libraries. In: Proceedings of the ACM SIGMOD workshop on information quality in information systems, Baltimore, MD, USA, June 2005Google Scholar
  24. 24.
    Li Z, Ng W, Sun A (2005) Web data extraction based on structural similarity. Knowl Inform Syst 8: 438–461CrossRefGoogle Scholar
  25. 25.
    Lu W, Milios J, Japkowicz M, Zhang Y (2006) Node similarity in the citation graph. Knowl Inform Syst 11: 105–129CrossRefGoogle Scholar
  26. 26.
    Meila M, Shi J (2001) A random walks view of spectral segmentation. In: Proceedings of the international conference on machine learningGoogle Scholar
  27. 27.
    Newman M (2004) Detecting community structure in networks. Eur Phys J B(38): 321–330Google Scholar
  28. 28.
    On B, Elmacioglu E, Lee D, Kang J, Pei J (2006) Improving grouped-entity resolution using quasi-cliques. In: Proceedings of the IEEE international conference on data miningGoogle Scholar
  29. 29.
    On B, Koudas N, Lee D, Srivastava D (2007) Group linkage. In: Proceedings of the IEEE international conference on data engineeringGoogle Scholar
  30. 30.
    On B, Lee D (2007) Scalable name disambiguation using multi-level graph partition. In: Proceedings of the SIAM international conference on data miningGoogle Scholar
  31. 31.
    On B, Lee I (2009) Google based name search: resolving mixed entities on the Web. In: Proceedings of the international conference on digital information managementGoogle Scholar
  32. 32.
    Pasula H, Marthi B, Milch B, Russell S, Shapitser I (2003) Identity uncertainty and citation matching. Advances in neural information processing 15, MIT press, CambridgeGoogle Scholar
  33. 33.
    Pothen A, Simon H, Liou K (1990) Partitioning sparse sparse matrices with eigenvectors of graphs. SIAM J Matrix Anal Appl 11(3): 430–452MathSciNetMATHCrossRefGoogle Scholar
  34. 34.
    Pothen A, Simon H, Wang L, Bernard S (1992) Toward a fast implementation of spectral nested dissection. In: Proceedings of the SUPERCOM, pp 42–51Google Scholar
  35. 35.
    SecondString: open-source java-based package of approximate string-matching. http://secondstring.sourceforge.net/
  36. 36.
    Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8): 888–905CrossRefGoogle Scholar
  37. 37.
    Slonim N, Friedman N, Tishby N (2002) Unsupervised document classification using sequential information maximization. In: Proceedings of the SIGIRGoogle Scholar
  38. 38.
    Verma D, Meila M (2003) Spectral clustering toolbox. http://www.ms.washington.edu/~spectral/
  39. 39.
    Wan X (2008) Beyond topical similarity: a structure similarity measure for retrieving highly similar document. Knowl Inform Syst 15: 55–73CrossRefGoogle Scholar
  40. 40.
    Wu X, Kumar V, Quinlan J, Ghosh J, Yang Q (2008) Top 10 algorithms in data mining. Knowl Inform Syst 14: 1–37CrossRefGoogle Scholar
  41. 41.
    Ye S, Wen J, Ma W (2007) A systematic study on parameter correlations in large-scale duplicate document detection. Knowl Inform Syst 14: 217–232CrossRefGoogle Scholar
  42. 42.
  43. 43.
    Yu S, Shi J (2003) Multiclass spectral clustering. In: Proceedings of the international conference on computer visionGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2011

Authors and Affiliations

  1. 1.Advanced Digital Sciences CenterIllinois at Singapore Pte LtdSingaporeSingapore
  2. 2.Sorrell College of BusinessTroy UniversityTroyUSA
  3. 3.College of Information Sciences and TechnologyPennsylvania State UniversityUniversity ParkUSA

Personalised recommendations