A Term-Based Driven Clustering Approach for Name Disambiguation

  • Jia Zhu
  • Xiaofang Zhou
  • Gabriel Pui Cheong Fung
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5446)


Name disambiguation in databases is a non-trivial task because people’s names are often not unique and usually only a limited information is associated with each name in the database. For example, in DBLP many authors share the same name, whereas we do not have any unique identifier to distinguish them. To make it worst, we may not always be able to access the full contents of the materials, unless we have joined those organizations (e.g. ACM) who publish them. As such, how to disambiguate different names with a very limited information is a very challenging task. In this paper, we focus ourselves on such situation. We propose a term-based driven clustering approach for solving it. Specifically, we first construct some term-based taxonomies to mimic the expert knowledge of the domain by linking the related terms that appear in there automatically. Each taxonomy is then transformed into a graph, and we group the entries that belong to the same author by using either of the two novel models, namely, graph-based similarity model and graph-based random walk model. The former model aims at computing the similarity among terms, whereas the later model aims at investigating how likely would a set of terms be transformed to another set of terms. Extensive experiments are conducted by using the entries in DBLP. The favorable results indicated that our proposed approach is highly effective.


Taxonomy Clustering Name Disambiguation Graph 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the American Statistical Association 21 (1969)Google Scholar
  2. 2.
    Bitton, D., Dewitt, D.J.: Duplicate record elimination in large data files. ACM TODS 8 (1983)Google Scholar
  3. 3.
    Monge, A.E., Elkan, C.P.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: SIGMOD Wshp. on Research Issues on Data Mining and Knowledge Discovery (1997)Google Scholar
  4. 4.
    McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: 6th ACM SIGKDD (2000)Google Scholar
  5. 5.
    Yin, X.X., Han, J.W.: Object Distinction: Distinguishing Objects with Identical Names. In: IEEE 23rd ICDE. ACM Press, New York (2007)Google Scholar
  6. 6.
    Han, H., Giles, C.L., Hong, Y.Z.: Two supervised learning approaches for name disambiguation in author citations. In: 4th ACM/IEEE Joint Conference on Digital Libraries (2004)Google Scholar
  7. 7.
    Han, H., Zhang, H., Giles, C.L.: Name disambiguation in author citations using a k-way spectral clustering method. In: 5th ACM/IEEE Joint Conference on Digital Libraries (2005)Google Scholar
  8. 8.
    Park, Y., Byrd, R.J., Boguraev, B.K.: Automatic Glossary Extraction:Beyond Terminology Identification. In: 19th International Conference on Computational Linguistics (2002)Google Scholar
  9. 9.
    Hliaoutakis, A., Zervanou, K., Petrakis, E.G., Milios, E.E.: Automatic document indexing in large medical collections. In: International Workshop on Healthcare information and Knowledge Management (2006)Google Scholar
  10. 10.
    Aleman-Meza, B., Decker, S., Cameron, D., Arpinar, I.B.: Association Analytics for Network Connectivity in a Bibliographic and Expertise Dataset. In: Semantic Web Engineering in the Knowledge Society (2008)Google Scholar
  11. 11.
    Wang, H., Teng, J.W., Lu, W.H., Chien, L.F.: Translating Unknown Cross-Lingual Queries in Digital Libraries Using a Web-based Approach. In: 4th ACM/IEEE Joint Conference on Digital Libraries (2004)Google Scholar
  12. 12.
    Rion, S., Daniel, J., Andrew, N.: Learning syntactic patterns for automatic hypernym discovery. In: Advances in Neural Information Processing Systems, vol. 17 (2004)Google Scholar
  13. 13.
    Bast, H., Durpret, G., Piwowarski, B.: Discovering a term taxonomy from term similarities using principal component analysis. In: Ackermann, M., Berendt, B., Grobelnik, M., Hotho, A., Mladenič, D., Semeraro, G., Spiliopoulou, M., Stumme, G., Svátek, V., van Someren, M. (eds.) EWMF 2005 and KDO 2005. LNCS, vol. 4289, pp. 103–120. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  14. 14.
    Arpinar, B., Hassell, J., Aleman-Meza, B.: Ontology-driven automatic entity disambiguation in unstructured text. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 44–57. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  15. 15.
    Rion, S., Daniel, J., Andrew, N.: Semantic taxonomy induction from heterogenous evidence. In: 21st International Conference on Computational Linguistics (2006)Google Scholar
  16. 16.
    Velardi, P., Cucchiarelli, A., Petit, M.: A Taxonomy Learning Method and Its Application to Characterize a Scientific Web Community. In: IEEE TKDE, vol. 19 (2007)Google Scholar
  17. 17.
    Yang, S., Jian, H., Isaac, G.C., Jia, L., Lee, G.: Efficient topic-based unsupervised name disambiguation. In: 7th ACM/IEEE Joint Conference on Digital Libraries (2007)Google Scholar
  18. 18.
    Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19 (1994)Google Scholar
  19. 19.
    Breaux, T.D., Reed, J.W.: Using Ontology in Hierarchical Information Clustering. In: 38th Annual Hawaii International Conference (2005)Google Scholar
  20. 20.
    Luján-Mora, S., Palomar, M.: Comparing string similarity measures for reducing inconsistency in integrating data from different sources. In: Wang, X.S., Yu, G., Lu, H. (eds.) WAIM 2001. LNCS, vol. 2118, p. 191. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  21. 21.
    Aldous, D.J.: Low bounds for covering times for reversible markov chains and random walks on graph. J. Theoretical probability 2 (1989)Google Scholar
  22. 22.
    Coppersmith, D., Feige, U., Shearer, J.: Random walks on regular and irregular graphs. SIAM J. Discret. Math. 9 (1996)Google Scholar
  23. 23.
    Lewis, D.: Naive Bayes at Forty: The Independence Assumption in Information Retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398. Springer, Heidelberg (1998)Google Scholar
  24. 24.
    Kalashnikov, D.V., Mehrotra, S.: Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans. Database Syst. 31 (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Jia Zhu
    • 1
  • Xiaofang Zhou
    • 1
  • Gabriel Pui Cheong Fung
    • 1
  1. 1.School of ITEEThe University of QueenslandAustralia

Personalised recommendations