Vietnamese Author Name Disambiguation for Integrating Publications from Heterogeneous Sources

  • Tin Huynh
  • Kiem Hoang
  • Tien Do
  • Duc Huynh
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7802)

Abstract

Automatic integration of bibliographical data from various sources is a really critical task in the field of digital libraries. One of the most important challenges for this process is the author name disambiguation. In this paper, we applied supervised learning approach and proposed a set of features that can be used to assist training classifiers in disambiguating Vietnamese author names. In order to evaluate efficiency of the proposed features set, we did experiments on five supervised learning methods: Random Forest, Support Vector Machine (SVM), k-Nearest Neighbors (kNN), C4.5 (Decision Tree), Bayes. The experiment dataset collected from three online digital libraries such as Microsoft Academic Search, ACM Digital Library, IEEE Digital Library. Our experiments shown that kNN, Random Forest, C4.5 classifier outperform than the others. The average accuracy archived with kNN approximates 94.55%, random forest is 94.23%, C4.5 is 93.98%, SVM is 91.91% and Bayes is lowest with 81.56%. Summary, we archived the highest accuracy 98.39% for author name disambiguation problem with the proposed feature set in our experiments on the Vietnamese authors dataset.

Keywords

Digital Library Data Integration Bibliographical Data Author Disambiguation Machine Learning 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive name matching in information integration. IEEE Intelligent Systems 18(5), 16–23 (2003)CrossRefGoogle Scholar
  2. 2.
    Cohen, W.W., Ravikumar, P.D., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: IIWeb, pp. 73–78 (2003)Google Scholar
  3. 3.
    Ferreira, A.A., Gonçalves, M.A., Laender, A.H.: A brief survey of automatic methods for author name disambiguation. SIGMOD Rec. 41(2), 15–26 (2012)CrossRefGoogle Scholar
  4. 4.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  5. 5.
    Han, H., Giles, L., Zha, H., Li, C., Tsioutsiouliklis, K.: Two supervised learning approaches for name disambiguation in author citations. In: Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2004, pp. 296–305. ACM, New York (2004)CrossRefGoogle Scholar
  6. 6.
    Han, H., Zha, H., Giles, C.L.: A model-based k-means algorithm for name disambiguation. In: Proceedings of Semantic Web Technologies for Searching and Retrieving Scientific Data, Florida, USA (October 20, 2003)Google Scholar
  7. 7.
    Han, H., Zha, H., Giles, C.L.: Name disambiguation in author citations using a k-way spectral clustering method. In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2005, pp. 334–343. ACM, New York (2005)CrossRefGoogle Scholar
  8. 8.
    Huang, J., Ertekin, S., Giles, C.L.: Efficient Name Disambiguation for Large-Scale Databases. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 536–544. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  9. 9.
    Huynh, T., Luong, H., Hoang, K.: Integrating Bibliographical Data of Computer Science Publications from Online Digital Libraries. In: Pan, J.-S., Chen, S.-M., Nguyen, N.T. (eds.) ACIIDS 2012, Part III. LNCS, vol. 7198, pp. 226–235. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  10. 10.
    Qian, Y., Hu, Y., Cui, J., Zheng, Q., Nie, Z.: Combining machine learning and human judgment in author disambiguation. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM 2011, pp. 1241–1246. ACM, New York (2011)Google Scholar
  11. 11.
    Treeratpituk, P., Giles, C.L.: Disambiguating authors in academic publications using random forests. In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2009, pp. 39–48. ACM, New York (2009)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Tin Huynh
    • 1
  • Kiem Hoang
    • 1
  • Tien Do
    • 1
  • Duc Huynh
    • 1
  1. 1.University of Information TechnologyHCMCVietnam

Personalised recommendations