Efficient Name Disambiguation for Large-Scale Databases

  • Jian Huang
  • Seyda Ertekin
  • C. Lee Giles
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4213)

Abstract

Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework for solving the name disambiguation problem: a blocking method retrieves candidate classes of authors with similar names and a clustering method, DBSCAN, clusters papers by author. The distance metric between papers used in DBSCAN is calculated by an online active selection support vector machine algorithm (LASVM), yielding a simpler model, lower test errors and faster prediction time than a standard SVM. We prove that by recasting transitivity as density reachability in DBSCAN, transitivity is guaranteed for core points. For evaluation, we manually annotated 3,355 papers yielding 490 authors and achieved 90.6% pairwise-F1. For scalability, authors in the entire CiteSeer dataset, over 700,000 papers, were readily disambiguated.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.: Automatic document metadata extraction using support vector machines. In: Proceedings of Joint Conference on Digital Libraries (JCDL 2003), pp. 37–48 (2003)Google Scholar
  2. 2.
    McCallum, A., Nigam, K., Ungar, L.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of KDD (2000)Google Scholar
  3. 3.
    Wellner, B., McCallum, A., Peng, F., Hay, M.: An integrated, conditional model of information extraction and coreference with application to citation matching. In: Proceedings of the 20th Conference on Uncertainty in AI, pp. 593–601 (2004)Google Scholar
  4. 4.
    Han, H., Zha, H., Giles, C.L.: Name disambiguation in author citations using a K-way spectral clustering method. In: Proceedings of JCDL, pp. 334–343 (2005)Google Scholar
  5. 5.
    Han, H., Giles, C.L., Zha, H., Li, C., Tsioutsiouliklis, K.: Two supervised learning approaches for name disambiguation in author citations. In: Proceedings of Joint Conference on Digital Libraries (JCDL 2004), pp. 296–305 (2004)Google Scholar
  6. 6.
    Lee, D., On, B., Kang, J., Park, S.: Effective and scalable solutions for mixed and split citation problems in digital libraries. In: ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS) (2005)Google Scholar
  7. 7.
    Mann, G.S., Yarowsky, D.: Unsupervised personal name disambiguation. In: Proceedings of CoNLL-7, pp. 33–40 (2003)Google Scholar
  8. 8.
    Bekkerman, R., McCallum, A.: Toward conditional models of identity uncertainty with application to proper noun coreference. In: IJCAI Workshop (2003)Google Scholar
  9. 9.
    Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: 2nd International Conference on Knowledge Discovery and Data Mining (KDD 1996), pp. 226–231 (1996)Google Scholar
  10. 10.
    Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive name-matching in information integration. IEEE Intelligent System 18(5) (2003)Google Scholar
  11. 11.
    Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)MATHGoogle Scholar
  12. 12.
    Bordes, A., Ertekin, S., Weston, J., Bottou, L.: Fast kernel classifiers with online and active learning. Journal of Machine Learning Research 6, 1579–1619 (2005)MathSciNetGoogle Scholar
  13. 13.
    Schohn, G., Cohn, D.: Less is more: Active learning with support vector machines. In: Proc. of 7th International Conf. on Machine Learning (ICML) (2000)Google Scholar
  14. 14.
    Ankerst, M., Breunig, M., Kriegel, H., Sander, J.: OPTICS: Ordering points to identify the clustering structure. In: Proc. of ACM SIGMOD, pp. 49–60 (1999)Google Scholar
  15. 15.
    Chang, C., Lin, C.: LIBSVM: a library for support vector machines (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Jian Huang
    • 1
  • Seyda Ertekin
    • 2
  • C. Lee Giles
    • 1
    • 2
  1. 1.College of Information Sciences and TechnologyThe Pennsylvania State UniversityUniversity ParkU.S.A.
  2. 2.Department of Computer Science and EngineeringThe Pennsylvania State UniversityUniversity ParkU.S.A.

Personalised recommendations