Completing features for author name disambiguation (AND): an empirical analysis

Waqas, Humaira; Qadir, Abdul

doi:10.1007/s11192-021-04229-x

Completing features for author name disambiguation (AND): an empirical analysis

Published: 11 January 2022

Volume 127, pages 1039–1063, (2022)
Cite this article

Scientometrics Aims and scope Submit manuscript

536 Accesses
1 Citation
Explore all metrics

Abstract

This study presents a feature enriched AND dataset to develop diverse and better performance achieving AND techniques, by utilizing AND features which have better discriminating abilities to solve this problem. Current AND datasets have limited number of useful AND features in them, some of them have been curated keeping in mind specific scenarios or contexts and some of them are domain specific. Rather than limiting the labelled datasets to be domain specific, contextual or hold limited feature values, it is better to leave their usage limit as a choice with respect to the technique which is trying to solve this problem. In this paper, our proposed labelled dataset “CustAND” provides a set of 7886 publication records, where each record covers more than eleven useful features values. The dataset covers multi domains as well as different ethnical group authors. CustAND is collected from multiple web sources, where raw data is extracted from digital libraries and search engines. This data is later cross checked, hand labelled and confirmed (authorship confirmation) by a team of graduate students with 100% accuracy. The raw data after pre-processing is validated by checking author’s personal web pages, different profile pages, their affiliations, and emails. This new dataset complements the availability of useful feature values which are crucial in developing generic and better performance achieving techniques to solve the author’s name ambiguity problem generally faced by the digital libraries.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An open-source tool for merging data from multiple citation databases

Article 08 June 2024

Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text

Article Open access 01 September 2023

Citation-based clustering of publications using CitNetExplorer and VOSviewer

Article Open access 27 February 2017

Notes

Features which can better resolve author name ambiguity as compared to others.
Digital Bibliography and Library Project is a computer science bibliography website.
Institute of Electrical and Electronics Engineers. The world’s largest technical professional organization dedicated to advancing technology for the benefit of humanity.
Association for Computing Machinery is the largest educational and scientific computing society.

References

Altman, D. G., & Bland, J. M. (1996). Statistics notes: Detecting skewness from summary information. BMJ, 313(7066), 1200.
Article Google Scholar
Aswani, N., Bontcheva, K., & Cunningham, H. (2006). Mining information for instance unification. In I. Cruz, S. Decker, D. Allemang, C. Preist, D. Schwabe, P. Mika, & L. M. Aroyo (Eds.), The semantic web: ISWC 2006 (pp. 329–342). Springer.
Chapter Google Scholar
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
Article Google Scholar
Cota, R. G., Ferreira, A. A., Nascimento, C., Gonçalves, M. A., & Laender, A. H. F. (2010). An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the Association for Information Science and Technology, 61(9), 1853–1870.
Google Scholar
Culotta, A., Kanani, P., Hall, R., Wick, M., & McCallum, A. (2007). Author disambiguation using error-driven machine learning with a ranking loss function. In Sixth International Workshop on Information Integration on the Web (IIWeb-07), Vancouver, Canada.
Ferreira, A. A., Gonçalves, M. A., & Laender, A. H. F. (2020). Automatic disambiguation of author names in bibliographic repositories. Synthesis Lectures on Information Concepts, Retrieval, and Services, 12(1), 1–146.
Article Google Scholar
Ferreira, A. A., Veloso, A., Gonçalves, M. A., & Laender, A. H. F. (2014). Self-training author name disambiguation for information scarce scenarios. Journal of the Association for Information Science and Technology, 65(6), 1257–1278. https://doi.org/10.1002/asi.22992
Article Google Scholar
Giles, C. L., Zha, H., & Han, H. (2005). Name disambiguation in author citations using a K-way spectral clustering method. In Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL ’05), pp. 334–343. https://doi.org/10.1145/1065385.1065462
Han, H., Xu, W., Zha, H., & Giles, C. L. (2005). A Hierarchical Naive bayes mixture model for name disambiguation in author citations. In Proceedings of the 2005 ACM Symposium on Applied Computing, pp. 1065–1069. https://doi.org/10.1145/1066677.1066920
HEC. (n.d.). Hec Approved list of disciplines and subjects. Retrieved from https://hec.gov.pk/english/scholarshipsgrants/OSSPhase-III/PublishingImages/Pages/Additional-Documents-for-Batch-3/HEC Approved list of disciplines and subjects PhD.pdf#search=engineering disciplines.
Kang, I.-S., Kim, P., Lee, S., Jung, H., & You, B.-J. (2011). Construction of a large-scale test set for author disambiguation. Information Processing and Management, 47(3), 452–465. https://doi.org/10.1016/j.ipm.2010.10.001
Article Google Scholar
Kim, J., Kim, J., & Owen-Smith, J. (2021). Ethnicity-based name partitioning for author name disambiguation using supervised machine learning. Journal of the Association for Information Science and Technology.
Kim, J., & Kim, J. (2020). Effect of forename string on author name disambiguation. Journal of the Association for Information Science and Technology, 71(7), 839–855.
Article Google Scholar
Kim, J., Kim, J., & Owen-Smith, J. (2019). Generating automatically labeled data for author name disambiguation: An iterative clustering method. Scientometrics, 118(1), 253–280.
Article Google Scholar
Kim, J., & Owen-Smith, J. (2021). ORCID-linked labeled data for evaluating author name disambiguation at scale. Scientometrics, 126(3), 2057–2083.
Article Google Scholar
Levin, M., Krawczyk, S., Bethard, S., & Jurafsky, D. (2012). Citation-based bootstrapping for large-scale author disambiguation. Journal of the American Society for Information Science and Technology, 63(5), 1030–1047. https://doi.org/10.1002/asi.22621
Article Google Scholar
McHugh, M. L. (2012). Interrater reliability: The kappa statistic. Biochemia Medica: Biochemia Medica, 22(3), 276–282.
Article MathSciNet Google Scholar
Momeni, F., & Mayr, P. (2016). Using co-authorship networks for author name disambiguation. IEEE/ACM Joint Conference on Digital Libraries (JCDL), 2016, 261–262.
Article Google Scholar
Müller, M.-C., Reitz, F., & Roy, N. (2017). Data sets for author name disambiguation: An empirical analysis and a new resource. Scientometrics, 111(3), 1467–1500. https://doi.org/10.1007/s11192-017-2363-5
Article MATH Google Scholar
Qian, Y., Zheng, Q., Sakai, T., Ye, J., & Liu, J. (2015). Dynamic author name disambiguation for growing digital libraries. Information Retrieval, 18(5), 379–412. https://doi.org/10.1007/s10791-015-9261-3
Article Google Scholar
Seol, J.-W., Lee, S.-H., & Kim, K.-Y. (2016). Author disambiguation using co-author network and supervised learning approach in scholarly data. International Journal of Software Engineering and Its Applications, 10(4), 73–82.
Article Google Scholar
Song, M., Kim, E.H.-J., & Kim, H. J. (2015). Exploring author name disambiguation on PubMed-scale. Journal of Informetrics, 9(4), 924–941.
Article MathSciNet Google Scholar
Torvik, V. I., & Agarwal, S. (2016). Ethnea--an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database.
Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(3), 11.
Article Google Scholar
Treeratpituk, P., & Giles, C. L. (2009). Disambiguating authors in academic publications using random forests. In Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 39–48. https://doi.org/10.1145/1555400.1555408
Wang, X., Tang, J., Cheng, H., & Yu, P. S. (2011). ADANA: active name disambiguation. In 2011 IEEE 11th International Conference on Data Mining, pp. 794–803.
Waqas, H., & Qadir, M. A. (2021). Multilayer heuristics based clustering framework (MHCF) for author name disambiguation. Scientometrics, 126(9), 7637–7678.
Article Google Scholar
Wikipedia, F. (2020). Google scholar. Retrieved from 20 November 2020 website: https://en.wikipedia.org/wiki/Google_Scholar.
Zhang, L., Lu, W., & Yang, J. (2021). LAGOS-AND: A large, gold standard dataset for scholarly author name disambiguation. http://arxiv.org/abs/2104.01821.
Zhu, J., Wu, X., Lin, X., Huang, C., Fung, G. P., & Tang, Y. (2018). A novel multiple layers name disambiguation framework for digital libraries using dynamic clustering. Scientometrics, 114(3), 781–794. https://doi.org/10.1007/s11192-017-2611-8
Article Google Scholar

Download references

Author information

Authors and Affiliations

Capital University of Science and Technology, Islamabad Expressway, Kahuta Road, Zone-V, Islamabad, Pakistan
Humaira Waqas & Abdul Qadir

Authors

Humaira Waqas
View author publications
You can also search for this author in PubMed Google Scholar
Abdul Qadir
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors have participated in (a) conception and design, or analysis and interpretation of the data; (b) drafting the article or revising it critically for important intellectual content; and (c) approval of the final version.

Corresponding author

Correspondence to Humaira Waqas.

Ethics declarations

Conflict of interest

This research has not received any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. The authors declare that they have no known competing financial interests or personal relationships which have or could be perceived to have influenced the work reported in this article.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Waqas, H., Qadir, A. Completing features for author name disambiguation (AND): an empirical analysis. Scientometrics 127, 1039–1063 (2022). https://doi.org/10.1007/s11192-021-04229-x

Download citation

Received: 01 May 2021
Accepted: 25 November 2021
Published: 11 January 2022
Issue Date: February 2022
DOI: https://doi.org/10.1007/s11192-021-04229-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Completing features for author name disambiguation (AND): an empirical analysis

Abstract

Access this article

Similar content being viewed by others

An open-source tool for merging data from multiple citation databases

Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text

Citation-based clustering of publications using CitNetExplorer and VOSviewer

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Completing features for author name disambiguation (AND): an empirical analysis

Abstract

Access this article

Similar content being viewed by others

An open-source tool for merging data from multiple citation databases

Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text

Citation-based clustering of publications using CitNetExplorer and VOSviewer

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation