Advertisement

Scientometrics

, Volume 114, Issue 3, pp 781–794 | Cite as

A novel multiple layers name disambiguation framework for digital libraries using dynamic clustering

  • Jia Zhu
  • Xingcheng Wu
  • Xueqin Lin
  • Changqin Huang
  • Gabriel Pui Cheong Fung
  • Yong Tang
Article
  • 487 Downloads

Abstract

In many types of databases, such as a science bibliography database, the name attribute is the most commonly used identifier to recognize entities. However, names are frequently ambiguous and not always unique, thereby causing problems in various fields. Name disambiguation is a data management task that aims to properly distinguish different entities that share the same name, particularly for large databases such as digital libraries, because the information that can be used to identify author’s name is limited. In digital libraries, the issue of ambiguous author names occurs due to the existence of multiple authors with the same name or different name variations for the same author. Most previous works conducted to solve this issue frequently used hierarchical clustering approaches based on information within citation records, e.g., co-authors and publication titles. In the present study, we propose a multiple layers name disambiguation framework that is not only applicable to digital libraries but can also be easily extended to other applications. Our framework adopts a dynamic clustering mechanism to minimize clustering errors. We evaluated our approach on real world corpora, and favorable experiment results indicated that our proposed framework was feasible.

Keywords

Name disambiguation Dynamic clustering Digital library 

Notes

Acknowledgements

This work was supported by the National Science Foundation of China (No. 61772211, 61370229, 61750110516), the Natural Science Foundation of Guangdong Province, China (No. 2015A030310509), the S&T Projects of Guangdong Province, China (No. 2016A030303055, 2016B030305004, 2016B010109008), and the science and technology Projects of Guangzhou Municipality, China (201604010003, 201604016019).

References

  1. Alvaro, E. & Charles, E. (1997). An efficient domain-independent algorithm for detecting approximately duplicate database records. In Research Issues on Data Mining and Knowledge Discovery, (pp. 23–29).Google Scholar
  2. Amancio, D. R., Oliveira, O. N, Jr., & da Costa, L. F. (2015). Topological-collaborative approach for disambiguating authors names in collaborative networks. Scientometrics, 102(1), 465–485.CrossRefGoogle Scholar
  3. Dina, B., & David, J. (1983). Duplicate record elimination in large data files. ACM Transactions on Database Systems, 8(2), 255–265.CrossRefMATHGoogle Scholar
  4. Dongwen, L., Byung-Won, O., Jaewoo, K., & Sanghyun, P. (2005). Effective and scalable solutions for mixed and split citation problems in digital libraries. In Proceedings of the 2nd International Workshop on Information Quality in Information Systems. ACM, (pp 69–76).Google Scholar
  5. Han, H., Zhang, H., & Giles, C. L. (2005). Name disambiguation in author citations using a k-way spectral clustering method. In 5th ACM/IEEE Joint Conference on Digital Libraries, (pp. 334–343).Google Scholar
  6. Hanna, P., Bhaskara, M., Brian, M., Stuart, J., & Ilya, S. (2002). Identity uncertainty and citation matching. Neural Information Processing Systems, (pp. 1401–1408).Google Scholar
  7. Hui, H., Hong, Y., & Lee, G. (2005). Name disambiguation in author citations using a k-way spectral clustering method. In 5th ACM/IEEE Joint Conference on Digital Libraries, (pp. 334–343).Google Scholar
  8. Ivan, P., & Alan, B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.CrossRefGoogle Scholar
  9. Kalashnikov, D. V., & Mehrotra, S. (2006). Domain-independent data cleaning via analysis of entity relationship graph. ACM Transactions Database System, 31(2), 716–767.CrossRefGoogle Scholar
  10. Liu, Y., Li, W., Huang, Z., & Fang, Q. (2015). A fast method based on multiple clustering for name disambiguation in bibliographic citations. Journal of the Association for Information Science and Technology, 66(3), 636–644.CrossRefGoogle Scholar
  11. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability.Google Scholar
  12. McCallum, A., Nigam, K., & Ungar, L. H. (2000). Efficient clustering of high-dimensional data sets with application to reference matching. Knowledge Discovery and Data Mining, (pp. 169–178).Google Scholar
  13. Schulz, J. (2015). Using monte carlo simulations to assess the impact of author name disambiguation quality on different bibliometric analyses. Scientometrics, 107(3), 1283–1298.CrossRefGoogle Scholar
  14. Shin, D., Kim, T., Choi, J., & Kim, J. (2014). Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics, 100(1), 15–50.CrossRefGoogle Scholar
  15. Song, Y., Huang, J., Councill, I. G., Li, J., & Giles., C. L. (2007). Efficient topic-based unsupervised name disambiguation. In 7th ACM/IEEE Joint Conference on Digital Libraries, (pp. 342–352).Google Scholar
  16. Szekely, G. J., & Rizzo, M. L. (2005). Hierarchical clustering via joint between-within distances: Extending ward’s minimum variance method. Journal of Classification, 22, 151–183.MathSciNetCrossRefMATHGoogle Scholar
  17. Tang, J., Fong, A., Wang, B., & Zhang, J. (2012). A unified probabilistic framework for name disambiguation in digital library. TKDE, 24(6), 975–987.Google Scholar
  18. Wu, J., & Ding, X. (2013). Author name disambiguation in scientific collaboration and mobility cases. Scientometrics, 96(3), 683–697.CrossRefGoogle Scholar
  19. Yang, K. H., Peng, H. T., Jiang, J. Y., Lee, H. M., & Ho, J. M. (2008). Author name disambiguation for citations using topic and web correlation. In Proceedings of 12th European Conference on Research and Advanced Technology for Digital Libraries, (pp. 185–196).Google Scholar
  20. Yin, X. X. & Han, J. W. (2007). Object distinction: Distinguishing objects with identical names. In IEEE 23rd International Conference on Data Engineering, (pp. 1242–1246).Google Scholar
  21. Zhu, J., Fung, G. P. C., & Zhou, X. F. (2009). A term-based driven clustering approach for name disambiguation. Proceedings on Joint APWeb/WAIM, (pp. 320–331).Google Scholar
  22. Zhu, J., Fung, G., & Zhou, X. (2010). Efficient web pages identification for entity resolution. 19th International World Wide Web, (pp. 1223–1224).Google Scholar
  23. Zhu, J., Yang, Y., Xie, Q., Wang, L. W., & Hassan, S. (2014). Robust hybrid name disambiguation framework for large databases. Scientometrics, 98(3), 2255–2274.CrossRefGoogle Scholar

Copyright information

© Akadémiai Kiadó, Budapest, Hungary 2017

Authors and Affiliations

  • Jia Zhu
    • 1
  • Xingcheng Wu
    • 1
  • Xueqin Lin
    • 1
  • Changqin Huang
    • 1
  • Gabriel Pui Cheong Fung
    • 2
  • Yong Tang
    • 1
  1. 1.School of Computer ScienceSouth China Normal UniversityGuangzhouChina
  2. 2.Department of Systems Engineering and Engineering ManagementThe Chinese University of Hong KongHong KongChina

Personalised recommendations