Generating automatically labeled data for author name disambiguation: an iterative clustering method

Kim, Jinseok; Kim, Jinmo; Owen-Smith, Jason

doi:10.1007/s11192-018-2968-3

Generating automatically labeled data for author name disambiguation: an iterative clustering method

Published: 29 November 2018

Volume 118, pages 253–280, (2019)
Cite this article

Scientometrics Aims and scope Submit manuscript

1194 Accesses
20 Citations
2 Altmetric
Explore all metrics

Abstract

To train algorithms for supervised author name disambiguation, many studies have relied on hand-labeled truth data that are very laborious to generate. This paper shows that labeled data can be automatically generated using information features such as email address, coauthor names, and cited references that are available from publication records. For this purpose, high-precision rules for matching name instances on each feature are decided using an external-authority database. Then, selected name instances in target ambiguous data go through the process of pairwise matching based on the rules. Next, they are merged into clusters by a generic entity resolution algorithm. The clustering procedure is repeated over other features until further merging is impossible. Tested on 26 K instances out of the population of 228 K author name instances, this iterative clustering produced accurately labeled data with pairwise F1 = 0.99. The labeled data represented the population data in terms of name ethnicity and co-disambiguating name group size distributions. In addition, trained on the labeled data, machine learning algorithms disambiguated 24 K names in test data with performance of pairwise F1 = 0.90–0.92. Several challenges are discussed for applying this method to resolving author name ambiguity in large-scale scholarly data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the combination of domain-specific heuristics for author name disambiguation: the nearest cluster method

Article 07 July 2015

A fast and integrative algorithm for clustering performance evaluation in author name disambiguation

Article 14 June 2019

The impact of imbalanced training data on machine learning for author name disambiguation

Article 27 July 2018

Notes

This paper distinguishes meanings of author, name, and name instance. An author refers to a distinct entity, a name to a textual string representing the author, and a name instance to an individual occurrence of the name in data. For example, an author (the distinguished professor Mark E. J. Newman at the University of Michigan Department of Physics) can be represented by one or more names (Mark Newman, M. E. J. Newman, etc.) that appear hundreds of times (i.e., instances) through his publication records in bibliometric data.
Other than these three types, a few studies have used synthetic labeled data (e.g., Milojević 2013). Another noticeable labeling approach is to use the intersection set of disambiguation results by multiple algorithms (Vogel et al. 2014).
This does not imply that only Han et al. (2004)’s data contain flaws. No other labeled data than Han et al. (2004)’s have received such intensive scrutiny for errors.
https://www.crossref.org/.
https://europepmc.org/.
Note that the cluster [#2|#5] is indexed as j = 2, not j = 3 because the prior merging removes cluster₂ [#1|#4] from clusterList.
https://clarivate.com/products/web-of-science/web-science-form/web-science-core-collection/.
https://clarivate.com/products/journal-citation-reports/.
https://static.aminer.org/lab-datasets/citation/dblp.v10.zip.
https://figshare.com/articles/ORCID_Public_Data_File_2017/5479792/1.
https://doi.org/10.13012/B2IDB-9087546_V1.
https://github.com/stanfordnlp/CoreNLP/blob/master/data/edu/stanford/nlp/patterns/surface/stopwords.txt.
https://tartarus.org/martin/PorterStemmer/.
Classifiers were implemented with parameter settings as follows: L2 Regularization with class weight = 1 (LR), Gaussian Naïve Bayes with maximum likelihood estimator (NB), and 500 trees (after grid search) with Gini Impurity for split quality measure (RF). For more details, see http://scikit-learn.org/stable/index.html.
The hierarchical agglomerative clustering algorithm and overall training-test procedure were implemented by modifying codes in Louppe et al. (2016).
http://dblp.org/faq/17334571.

References

Bornmann, L., & Mutz, R. (2015). Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references. Journal of the Association for Information Science and Technology, 66(11), 2215–2222. https://doi.org/10.1002/asi.23329.
Article Google Scholar
Cota, R. G., Ferreira, A. A., Nascimento, C., Gonçalves, M. A., & Laender, A. H. F. (2010). An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 61(9), 1853–1870. https://doi.org/10.1002/asi.21363.
Article Google Scholar
D’Angelo, C. A., Giuffrida, C., & Abramo, G. (2011). A heuristic approach to author name disambiguation in bibliometrics databases for large-scale research assessments. Journal of the American Society for Information Science and Technology, 62(2), 257–269. https://doi.org/10.1002/asi.21460.
Article Google Scholar
Ferreira, A. A., Gonçalves, M. A., & Laender, A. H. F. (2012). A brief survey of automatic methods for author name disambiguation. Sigmod Record, 41(2), 15–26.
Article Google Scholar
Ferreira, A. A., Veloso, A., Gonçalves, M. A., & Laender, A. H. F. (2014). Self-training author name disambiguation for information scarce scenarios. Journal of the Association for Information Science and Technology, 65(6), 1257–1278. https://doi.org/10.1002/asi.22992.
Article Google Scholar
Gomide, J., Kling, H., & Figueiredo, D. (2017). Name usage pattern in the synonym ambiguity problem in bibliographic data. Scientometrics, 112(2), 747–766. https://doi.org/10.1007/s11192-017-2410-2.
Article Google Scholar
Haak, L. L., Fenner, M., Paglione, L., Pentz, E., & Ratner, H. (2012). ORCID: A system to uniquely identify researchers. Learned Publishing, 25(4), 259–264. https://doi.org/10.1087/20120404.
Article Google Scholar
Han, H., Giles, L., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the fourth ACM/IEEE joint conference on digital libraries (pp. 296–305). https://doi.org/10.1145/996350.996419.
Han, H., Xu, W., Zha, H., & Giles, C. L. (2005a). A hierarchical naive Bayes mixture model for name disambiguation in author citations. In Proceedings of the 2005 ACM symposium on Applied computing—SAC ‘05. Santa Fe, New Mexico.
Han, H., Zha, H. Y., & Giles, C. L. (2005b). Name disambiguation spectral in author citations using a K-way clustering method. In Proceedings of the 5th ACM/IEEE joint conference on digital libraries, proceedings (pp. 334–343). https://doi.org/10.1145/1065385.1065462.
Kang, I. S., Kim, P., Lee, S., Jung, H., & You, B. J. (2011). Construction of a large-scale test set for author disambiguation. Information Processing and Management, 47(3), 452–465. https://doi.org/10.1016/j.ipm.2010.10.001.
Article Google Scholar
Kim, J. (2018). Evaluating author name disambiguation for digital libraries: A case of DBLP. Scientometrics, 116(3), 1867–1886. https://doi.org/10.1007/s11192-018-2824-5.
Article Google Scholar
Kim, J., & Diesner, J. (2016). Distortive effects of initial-based name disambiguation on measurements of large-scale coauthorship networks. Journal of the Association for Information Science and Technology, 67(6), 1446–1461. https://doi.org/10.1002/asi.23489.
Article Google Scholar
Kim, J., & Kim, J. (2018). The impact of imbalanced training data on machine learning for author name disambiguation. Scientometrics, 117(1), 511–526. https://doi.org/10.1007/s11192-018-2865-9.
Article Google Scholar
Lerchenmueller, M. J., & Sorenson, O. (2016). Author disambiguation in PubMed: Evidence on the precision and recall of author-ity among NIH-funded scientists. PLoS ONE, 11(7), e0158731. https://doi.org/10.1371/journal.pone.0158731.
Article Google Scholar
Levin, M., Krawczyk, S., Bethard, S., & Jurafsky, D. (2012). Citation-based bootstrapping for large-scale author disambiguation. Journal of the American Society for Information Science and Technology, 63(5), 1030–1047. https://doi.org/10.1002/asi.22621.
Article Google Scholar
Liu, W., Islamaj Dogan, R., Kim, S., Comeau, D. C., Kim, W., Yeganova, L., et al. (2014). Author name disambiguation for PubMed. Journal of the Association for Information Science and Technology, 65(4), 765–781. https://doi.org/10.1002/asi.23063.
Article Google Scholar
Louppe, G., Al-Natsheh, H. T., Susik, M., & Maguire, E. J. (2016). Ethnicity sensitive author disambiguation using semi-supervised learning. Knowledge Engineering and Semantic Web, Kesw, 2016(649), 272–287. https://doi.org/10.1007/978-3-319-45880-9_21.
Google Scholar
Milojević, S. (2013). Accuracy of simple, initials-based methods for author name disambiguation. Journal of Informetrics, 7(4), 767–773.
Article Google Scholar
Müller, M. C., Reitz, F., & Roy, N. (2017). Data sets for author name disambiguation: an empirical analysis and a new resource. Scientometrics, 111(3), 1467–1500. https://doi.org/10.1007/s11192-017-2363-5.
Article MATH Google Scholar
Newman, M. E. J. (2001). The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences of the United States of America, 98(2), 404–409. https://doi.org/10.1073/pnas.021544898.
Article MathSciNet MATH Google Scholar
Porter, M. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.
Article Google Scholar
Qian, Y., Zheng, Q., Sakai, T., Ye, J., & Liu, J. (2015). Dynamic author name disambiguation for growing digital libraries. Information Retrieval Journal, 18(5), 379–412. https://doi.org/10.1007/s10791-015-9261-3.
Article Google Scholar
Santana, A. F., Gonçalves, M. A., Laender, A. H. F., & Ferreira, A. A. (2015). On the combination of domain-specific heuristics for author name disambiguation: The nearest cluster method. International Journal on Digital Libraries, 16(3–4), 229–246. https://doi.org/10.1007/s00799-015-0158-y.
Article Google Scholar
Schulz, C., Mazloumian, A., Petersen, A. M., Penner, O., & Helbing, D. (2014). Exploiting citation networks for large-scale author name disambiguation. EPJ Data Science. https://doi.org/10.1140/epjds/s13688-014-0011-3.
Google Scholar
Schulz, J. (2016). Using Monte Carlo simulations to assess the impact of author name disambiguation quality on different bibliometric analyses. Scientometrics, 107(3), 1283–1298. https://doi.org/10.1007/s11192-016-1892-7.
Article Google Scholar
Shin, D., Kim, T., Choi, J., & Kim, J. (2014). Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics, 100(1), 15–50. https://doi.org/10.1007/s11192-014-1289-4.
Article Google Scholar
Smalheiser, N. R., & Torvik, V. I. (2009). Author name disambiguation. Annual Review of Information Science and Technology, 43, 287–313.
Article Google Scholar
Song, M., Kim, E. H. J., & Kim, H. J. (2015). Exploring author name disambiguation on PubMed-scale. Journal of Informetrics, 9(4), 924–941. https://doi.org/10.1016/j.joi.2015.08.004.
Article Google Scholar
Strotmann, A., & Zhao, D. Z. (2012). Author name disambiguation: What difference does it make in author-based citation analysis? Journal of the American Society for Information Science and Technology, 63(9), 1820–1833. https://doi.org/10.1002/asi.22695.
Article Google Scholar
Torvik, V. I. (2015). MapAffil: A Bibliographic tool for mapping author affiliation strings to cities and their geocodes worldwide. D-Lib magazine: The magazine of the Digital Library Forum. https://doi.org/10.1045/november2015-torvik.
Google Scholar
Torvik, V. I., & Agarwal, S. (2016). Ethnea: An instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. In Paper presented at the library of congress international symposium on science of science, Washington, DC, USA. http://hdl.handle.net/2142/88927.
Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data. https://doi.org/10.1145/1552303.1552304.
Google Scholar
Treeratpituk, P., & Giles, C. L. (2009). Disambiguating authors in academic publications using random forests. In Proceedings of the 2009 ACM/IEEE joint conference on digital libraries (pp. 39–48).
Vogel, T., Heise, A., Draisbach, U., Lange, D., & Naumann, F. (2014). Reach for gold: An annealing standard to evaluate duplicate detection results. Journal of Data and Information Quality, 5(1–2), 1–25. https://doi.org/10.1145/2629687.
Article Google Scholar
Wang, J., Berzins, K., Hicks, D., Melkers, J., Xiao, F., & Pinheiro, D. (2012). A boosted-trees method for name disambiguation. Scientometrics, 93(2), 391–411. https://doi.org/10.1007/s11192-012-0681-1.
Article Google Scholar
Wang, X., Tang, J., Cheng, H., & Yu, P. S. (2011). ADANA: Active name disambiguation. In Paper presented at the 2011 IEEE 11th international conference on data mining. http://ieeexplore.ieee.org/document/6137284/.
Whang, S. E., Menestrina, D., Koutrika, G., Theobald, M., & Garcia-Molina, H. (2009). Entity resolution with iterative blocking. In Proceedings of the 2009 ACM SIGMOD international conference on management of data.

Download references

Acknowledgements

This work was supported by Grants from the National Science Foundation (#1,561,687 and #1535370), the Alfred P. Sloan Foundation and the Ewing Marion Kauffman Foundation. We would like to thank anonymous reviewers for their helpful comments.

Author information

Authors and Affiliations

Institute for Research on Innovation and Science, Survey Research Center, Institute for Social Research, University of Michigan, 330 Packard Street, Ann Arbor, MI, 48104-2910, USA
Jinseok Kim
School of Information Sciences, University of Illinois at Urbana-Champaign, 501 E. Daniel Street, Champaign, IL, 61820-6211, USA
Jinmo Kim
Department of Sociology, Institute for Social Research, University of Michigan, 330 Packard Street, Ann Arbor, MI, 48104-2910, USA
Jason Owen-Smith

Authors

Jinseok Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jinmo Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jason Owen-Smith
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jinseok Kim.

Appendices

Appendix A: Construction of self-citation relation

If a paper cites another paper, they are in citing-cited relation. From this paper-level citation information, scholars have constructed author-level citation relation. In Fig. 6, Author A and Author B coauthors Paper 1, while Author C and Author D writes together Paper 2. If Paper 2 cites Paper 1 (paper-level citation), authors in Paper 2 are assumed to refer to authors in Paper 1. Thus, Author C is depicted to cite Author A and Author B, and Author D to cite Author A and Author B. If Author C is the same as Author A, they are in self-citation relation.

Appendix B: Representativeness checks for ORCID-linked data

This section checks how the ORCID-linked data (Methodology > Data and Pre-processing > ORCID-Linkage) represent the whole data in this study. Following the method described in Representativeness Checks of Results, the ratios of name ethnicity and block size of ORCID-linked data are compared to those of the whole data. Figure 7 shows that in ORCID-linked data, Chinese names are under-represented while Hispanic and Italian names are over-represented while other ethnic names show similar ratios. This observation is contrasted to that from Fig. 1 where Chinese names are slightly over-represented and English names are a little under-represented.

Regarding the block size distribution in Fig. 8, the distribution plot of ORCIDs starts higher in y-axis (= ratio) than that of Random Data but falls below as x-value (= block size) increases. This means that ORCID-linked data contain more small blocks and less large blocks compared to randomly selected subset with the same number of name instances as ORCID-linked data, while automatically labeled data produce block size distribution quite similar to that of random data in Fig. 2.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, J., Kim, J. & Owen-Smith, J. Generating automatically labeled data for author name disambiguation: an iterative clustering method. Scientometrics 118, 253–280 (2019). https://doi.org/10.1007/s11192-018-2968-3

Download citation

Received: 26 June 2018
Published: 29 November 2018
Issue Date: 15 January 2019
DOI: https://doi.org/10.1007/s11192-018-2968-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Generating automatically labeled data for author name disambiguation: an iterative clustering method

Abstract

Access this article

Similar content being viewed by others

On the combination of domain-specific heuristics for author name disambiguation: the nearest cluster method

A fast and integrative algorithm for clustering performance evaluation in author name disambiguation

The impact of imbalanced training data on machine learning for author name disambiguation

Notes

References

Acknowledgements